Cleansing and Conforming Data

Cleansing and Conforming Data

This process of Cleansing and Conforming Data change data on its way from source system(s) to the data warehouse and can also be used to identify and record errors about data. The latter information can be used to fix how the source system(s) work(s).

Good quality source data has to do with “Data Quality Culture” and must be initiated at the top of the organization. It is not just a matter of implementing strong validation checks on input screens, because almost no matter how strong these checks are, they can often still be circumvented by the users.

There is a nine-step guide for organizations that wish to improve data quality:

  • Declare a high level commitment to a data quality culture
  • Drive process reengineering at the executive level
  • Spend money to improve the data entry environment
  • Spend money to improve application integration
  • Spend money to change how processes work
  • Promote end-to-end team awareness
  • Promote interdepartmental cooperation
  • Publicly celebrate data quality excellence
  • Continuously measure and improve data quality

Contents

Data Cleansing System

The essential job of this system is to find a suitable balance between fixing dirty data and maintaining the data as close as possible to the original data from the source production system. This is a challenge for the ETL architect.

The system should offer an architecture that can cleanse data, record quality events and measure/control quality of data in the data warehouse.

A good start is to perform a thorough data profiling analysis that will help define to the required complexity of the data cleansing system and also give an idea of the current data quality in the source system(s).

Quality Screens

Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three categories:

  • Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.
  • Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition it should adhere.
  • Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered.

When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data. The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear, what should happen to these data.

Error Event Schema

This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. In addition there is an Error Event Detail Fact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.

References

Sources

  • Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., Becker, B. The Data Warehouse Lifecycle Toolkit, Wiley Publishing, Inc., 2008. ISBN 978-0-470-14977-5.
  • Olson, J. E. Data Quality: The Accuracy Dimension", Morgan Kauffman, 2002. ISBN 1558608915.

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Data Vault Modeling — is a database modeling method that is designed to provide historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as… …   Wikipedia

  • Aristotle: Aesthetics and philosophy of mind — David Gallop AESTHETICS Aesthetics, as that field is now understood, does not form the subjectmatter of any single Aristotelian work. No treatise is devoted to such topics as the essential nature of a work of art, the function of art in general,… …   History of philosophy

  • UNITED STATES OF AMERICA — UNITED STATES OF AMERICA, country in N. America. This article is arranged according to the following outline: introduction Colonial Era, 1654–1776 Early National Period, 1776–1820 German Jewish Period, 1820–1880 East European Jewish Period,… …   Encyclopedia of Judaism

  • United States — a republic in the N Western Hemisphere comprising 48 conterminous states, the District of Columbia, and Alaska in North America, and Hawaii in the N Pacific. 267,954,767; conterminous United States, 3,022,387 sq. mi. (7,827,982 sq. km); with… …   Universalium

  • Timothy Leary — For the American baseball player, see Tim Leary (baseball). Timothy Leary Born Timothy Francis Leary October 22, 1920(1920 10 22) Springfield, Massachusetts, U.S …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”