Slowly changing dimension

Slowly changing dimension

Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

For example, you may have a Dimension in your database that tracks the sales records of your company's salesmen. Creating sales reports seems simple enough, until a sales person is transferred from one regional office to another. How do you record such a change in your sales Dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of sales people, that might give misleading information. If the sales person that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, his totals will look much stronger than the other salespeople in his new region, even if they are just as good. Or you could create a second sales-person record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

The most common slowly changing dimensions are Types 1, 2, and 3.

Type 0

Type 0 is used somewhat infrequently, to refer to an SCD where no effort has been made to deal with the issues.

Type 1

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.)

Another example would be of a database table that keeps supplier information.

Another popular method for tuple versioning is to add effective date columns.

To query the Star Schema using the historic reference data as of the date of the delivery, the query looks like this:SELECTdelivery.delivery_cost,supplier.supplier_name,supplier.supplier_state

FROM delivery

INNER JOIN supplier ON delivery.supplier_key = supplier.supplier_key

WHEREdelivery.delivery_date >=supplier.start_dateAND delivery.delivery_date To query using the order date, the SQL simply needs to be changed to reference the Order_Date.

To query the Star Schema using the current reference data, the query looks like this:SELECTdelivery.delivery_cost,supplier.supplier_name,supplier.supplier_state

FROM delivery

INNER JOIN supplier ON delivery.supplier_key = supplier.supplier_key

WHEREsupplier.start_date <= NOW()AND supplier.end_date > NOW()

Notes
* NOW() is PostgreSQL syntax, getdate() is SQL Server syntax, for Oracle use SYSDATE

* Caution - If the WHERE clause restricting the rows in the dimension table is not present, then the query will potentially return duplicate rows and give the wrong answers, so this technique should be used with care.

* Some Business Intelligence tools do not handle generating complex joins such as this well.

* The Etl processes needed to create this table also need to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.

* The >= and < clause is required to ensure no time period is omitted.

* Often a view is created over the table which can filter out the rows or columns. This simplifies joins to the table if only the current rows are needed for certain queries. This view could be materialized to a physical table if storage space is not a problem. This can be done automatically by most modern DBMS's and automatically kept up to date.

Type 6 / Hybrid - Alternative Implementation

One of the drawbacks of the above implementation is that a many-to-many relationship between the dimension and the fact table is never resolved into the data model.This many-to-many relationship can't be resolved at a logical and Physical Data Model Standpoint. The way it is designed, it will be only resolved at runtime, at a report level, when end-user fills in the 'As At Date' parameter. The consequence is that no referential integrity can be enforced at a RDBMS level between the fact and the dimension table.A variant of this SCD type 6 exists that implements all the advantages of SCD6 without this inconvenience.

This variant is based on the dimension table primary key made of a surrogate key and a version number (and not a Start date).The version number of the current dimension record would always be 0.Before a dimension record gets updated, the version 0 is copied over a new record (version n+1) and the version 0 can be updated.

In the fact table, when a fact is added, it can be only using the current dimension record, Consequently, all the fact records will have a version number equals to 0.

The advantage of this SCD Type 6 implementation is that the many to many between the fact and dimension is resolved, the priority given to the AS IS version (version number=0).

Another advantage is that the RDBMS can ensure the referential integrity between the fact table and the dimension table. Finally, when using a Design Tool like Erwin, Powerdesigner...in the Physical Model, fact tables are linked to dimension tables.The Dimension surrogate key + the dimension version number is a compound primary key , and can be stored into the fact table , even though the version number will be always 0.

In the implementation using the Start date in the primary key, all the tables in the Physical Data Model will look like standalone tables and no integrity constraint can be enforced by the RDBMS (even though it is not always desirable).

Combining Types

Different types can be applied to different columns of a table. For example, we can apply Type 1 to Supplier Name column and Type 2 to Supplier State column of the same table, the Supplier table.

Notes

References

*Ralph Kimball, Margy Ross: "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)", John Wiley & Sons, ISBN 0-471-20024-7

*Bruce Ottmann, Chris Angus: "Data processing system", US Patent Office, Patent Number [http://patft.uspto.gov/netacgi/nph-Parser?u=%2Fnetahtml%2Fsrchnum.htm&Sect1=PTO1&Sect2=HITOFF&p=1&r=1&l=50&f=G&d=PALL&s1=7003504.PN.&OS=PN/7003504&RS=PN/7003504 7,003,504] . February 21, 2006

*Ralph Kimball:"Kimball University: Handling Arbitrary Restatements of History" [http://www.informationweek.com/showArticle.jhtml?articleID=204800027&pgno=1&queryText=] . December 9, 2007


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Slowly Changing Dimensions — Unter dem Begriff Slowly Changing Dimensions (deutsch: sich langsam verändernde Dimensionen) werden im Data Warehousing Methoden zusammengefasst, um Änderungen in Dimensionstabellen zu erfassen und ggf. historisch zu dokumentieren. Im… …   Deutsch Wikipedia

  • Dimension (data warehouse) — This article is about a dimension in a data warehouse. For other uses, see dimension (disambiguation). In a data warehouse, a dimension is a data element that categorizes each item in a data set into non overlapping regions. A data warehouse… …   Wikipedia

  • Dimension à évolution lente — En informatique décisionnelle, on parle d une dimension à évolution lente (slowly changing dimension) lorsqu une dimension peut subir des changements de description des membres. Un client peut changer d’adresse, se marier, ... Un produit peut… …   Wikipédia en Français

  • Dimension table — In data warehousing, a dimension table is one of the set of companion tables to a fact table. The fact table contains business facts or measures and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.… …   Wikipedia

  • Degenerate dimension — Contents 1 The Kimball definition 2 Other uses of the term 3 See also 4 External reference 5 Notes …   Wikipedia

  • Data warehouse — Overview In computing, a data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations… …   Wikipedia

  • Scd — The abbreviation, acronym, or initialism SCD has several different meanings:In medicine* Sperm chromatin dispersion test * Semen collection device [cite journal |author=Sofikitis NV, Miyagawa I |title=Endocrinological, biophysical, and… …   Wikipedia

  • Change data capture — In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data… …   Wikipedia

  • Open Source Business Intelligence — L OSBI, acronyme de Open Source Business Intelligence, regroupe l ensemble des solutions et techniques liées au décisionnel et dont le modèle s appuie sur l Open Source. Ce concept, malgré la ressemblance, n a aucun rapport avec l Open Source… …   Wikipédia en Français

  • SCD — Sudden Cardiac Death (Medical » Physiology) Sickle Cell Disease (Medical » Physiology) * Specific Carbohydrate Diet (Miscellaneous » Food) * Scientific Computing Division (Business » General) * Sports Collectors Digest (Community » Media) *… …   Abbreviations dictionary

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”