Big Data & Analytics .... Flogged: Problem with History

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.

The Problem with History
Performing an initial load of a type 2 dimension from a history-tracking source
A source system that keeps its own history presents interesting problems when used to load a type 2 history-tracking dimension table. The problem I am referring to arose in an actual project where we had a source system (SAP) which was keeping its own history of changes to an entity in a comprehensive, audit-trail sort of way. It appeared to have been tracking every change to the entity in question. But, in the data warehouse, for the corresponding dimension we only wanted to track history for a subset of attributes, treating the rest as type 1 attributes. This article focuses on the initial load, not the continuing maintenance, of a dimension from a set of history-tracking source records.

Let’s say that, in the source system (an HR application, for instance), we have the records shown in Figure 1.

Picture (Metafile)

In this table, EMPKEY is the "person key" or "person ID, " which is a surrogate key in the source system itself. Social Security NUmber, which is often thought of as a key, is a simple attribute. The primary key of the table is EMPKEY + VALID_FROM. A real HR system would obviously have more attributes than this, but, for purposes of this article, all we need are a mix of type 1 and type 2 attributes.

The table tells a tale. Jo, a female, living in Michigan, gets hired on 12/3/1998. On 12/27/1998, the HR staffers discover they entered her SSN incorrectly and change it; the source system automatically creates a new record, with a new VALID_FROM date of 12/28/2008. Seven years go by until, in April of 2005, Jo, who’s always felt uncomfortable as a female, goes into the online HR portal and changes her first name to "Joe"; the source system dutifully tracks the change. On August 8, Joe finally gets her operation, and an HR staff member changes her > his sex from "F" to "M," and her > his name to "Joseph." On February 13, Joseph decides he really prefers to go by "Joe" again, until July 5, when he flips his name back to "Joseph" and transfers to the company’s Big Apple office. On Christmas Eve, weary of the brusque and chilly streets of Manhattan, Joseph returns to Michigan. Jim or James, the other person in this table, has a simpler history: he changes from "Jim" to "James" on 3/16/2004 and then moves to Indiana on 6/23/2007. He’s apparently let go on 8/31/2007. (In this HR system, if your latest record doesn’t have a NULL in "VALID_TO," it means you’re terminated as of that date.)

The business users have stated that they don’t care about tracking or reporting on historical changes to first names or SSNs, but they do care about sex and state. In data warehousing terms, FIRSTNAME and SSN are type 1 attributes, and SEX and STATE are type 2.

We are doing an initial load of our PERSON dimension. Again, we do not cover, in this article, how to deal with comparisons between a history-tracking, type 2-ish source and an already-populated dimension table.

Ultimately, our PERSON dimension table should look like Figure 2.

Picture (Metafile)

DIM_K is just a standard dimension table surrogate key; only those records for the two people in question are displayed. Note that where we have a currently active record with no "natural" or source end date (the employee is still employed, the person is still alive, etc.), we enter a fake "high date" to make querying and joins easier. (This sort of thing will, doubtless, cause a massive "Year 9999 Problem.") Our PERSON dimension table tracks history for those attributes we care about, but washes out the history for the type 1 attributes. To get to this blissful state, we need to sift out the type 1 changes from the type 2 and only make new dimension records where we have a type 2 change.

In what follows, we’ll stick to fairly simple set logic. We’ll use no cursors or other looping-type constructs. While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

Our first step is to order and number our records. Depending on what system you’re using, there will be any number of ways to do this, but start by ordering your source records by EMPKEY, VALID_FROM and add a row numbering RANK column (see Figure 3).

Picture (Metafile)

Next, we’ll need the earliest and latest records for each natural key. The earliest ones will form a sort of "base" for future joins; the latest ones carry the values we’ll use for the type 1 attributes. (The business rule here is that the latest records hold the type 1 attributes to use; other rules would demand different coding.)

NK_EARLIEST
select

s.*
from
src_ranked s inner join (select key, min(rank) r from src_ranked) mins
on s.rank = mins.r
Picture (Metafile)
NK_LATEST
select
s.*
from
src_ranked s inner join (select key, max(rank) r from src_ranked) maxes
on s.rank = maxes.r

The existence of this NK_LATEST table will serve us well later on.
Next, let’s make a table of the type 2 columns only. Within these we’ll need to sift out the records for a given natural key where there’s no change.

TYPE2_COLS
select
rank
,key
,sex
,state
,valid_from
,valid_to
from
src_ranked
order by
rank
Picture (Metafile)
In the table in Figure 6, the columns values that will form the VALID_FROM’s in our dimension are highlighted in red, and those that will supply the VALID_TO are blue.

For our next trick, inner join this table to itself, each record to the record one above it, in rank, where the key (natural key) is the same and where there’s something different about the type 2 attributes:

TYPE2_CHANGES
select
t2c1.rank rank1
,t2c1.key key1
,t2c1.sex sex1
,t2c1.state state1
,t2c1.valid_from valid_from1
,t2c1.valid_to valid_to1
,t2c2.rank rank2
,t2c2.key key2
,t2c2.sex sex2
,t2c2.state state2
,t2c2.valid_from valid_from2
,t2c2.valid_to valid_to2
from
type2_cols t2c1 inner join type2_cols t2c2
on t2c1.rank + 1 = t2c2.rank
and t2c1.key = t2c2.key
where
t2c1.sex <> t2c2.sex or t2c1.state <> t2c2.state
To view Figure 7 see PDF below.
We can see that we’re very close to what we’re looking for in the VALID_TO1 and VALID_FROM2 columns. But what we want (leaving out the irrelevant columns) is shown in Figure 8.

Picture (Metafile)
To get there, do a quasi-pivot of TYPE2_CHANGES, like this:
TYPE2_C_PIV
select * from
(
select
rank1 rank
,key1 key
,sex1 sex
,state1 state
,valid_to1 valid_date
from
type2_changes
union
select
rank2 rank
,key2 key
,sex2 sex
,state2 state
,valid_from2 valid_date
from
type2_changes
) s
order by rank
Picture (Metafile)
Add the handy-dandy, presaved NK_EARLIEST and NK_LATEST records back in, and order by rank:
TYPE2_C_ALL
select * from
(
select * from type2_c_piv
union
select
rank
,key
,sex
,state
,valid_from valid_date
from nk_earliest
union
select
rank
,key
,sex
,state
,IsNull(valid_to,’12/31/9999’) valid_date
from nk_latest
)
order by rank, valid_date
Picture (Metafile)

Put an ordinal or row number on these rows again:

Picture (Metafile)
The odd-numbered records contain the VALID_FROM dates, and the next-higher-numbered row contains the VALID_TO dates. Join ‘em, Dano. The following assumes the availability of a mod (modulus) operator to find the odd-numbered rows; there are certainly other ways to find odd numbers.

PRE_DIM
select
t1.key
,t1.sex
,t1.state
,t1.valid_date ‘valid_from’
,t2.valid_date ‘valid_to’
from
type2_c_all_ranked t1 inner join type2_c_all_ranked t2
on t1.new_rank + 1 = t2.new_rank
where
mod (t1.new_rank,2) = 1
Picture (Metafile)
The remainder is trivial. Join to the "latest" records on the natural key to pick up the type 1 dimension, and, while it’s handy, set the current flag:

Add_T1 ("add type 1")
select
pd.key as ‘nat_k’
,l1.first_name
,l1.ssn
,pd.sex
,pd.state
,pd.valid_from
,pd.valid_to
,case when pd.valid_from = l.valid_from then ‘Y’ else ‘N’ end as ‘current_flag’
from
pre_dim pd inner join nk_latest l
on pd.key = l.key
Then, generate surrogate keys for your target dimension table and load it.

Big Data & Analytics .... Flogged

Monday, 27 June 2011

Problem with History - a look at surrogate key usage

No comments:

Blog Archive

About Me