Monday 27 June 2011

To Surrogate Key or Not

To Surrogate Key or Not
Design Challenge
Information Management Magazine, August 2006
A great way to sharpen our analysis and modeling skills is to continuously address real-world scenarios. A modeling scenario along with suggested solutions appears each month in this Design Challenge column. The scenario is emailed to more than 1,000 designers up to the challenge. Many of the responses, including my own, are consolidated here. If you would like to become a Design Challenger and have the opportunity to submit modeling solutions

Picture (Metafile)
The Response
A surrogate key is a unique identifier for a table, a numeric counter introduced early in the physical design stage. No meaning is implied by a surrogate key value. For example, you can't look at a month identifier of 1 and assume that it represents January. A surrogate key allows for more efficient navigation across structures and facilitates integration across applications. However, before telling the manager to always use surrogates, we need to consider the drawbacks: it can take more time and effort to navigate structures as well as add more complexity to the development phase. Therefore, we need guidelines in determining where it makes sense to create a surrogate key.

Create a surrogate key if the entity has a natural key and any of the following conditions hold:

  • There are multiple sources;
  • The natural key is inefficient; or
  • The natural key is recycled.

Surrogate versus Virtual
A natural key is ideally how the business would identify an entity instance, and a surrogate key is a substitute for this natural key. For example, a natural key for an organization could be taxpayer identifier or DUNS (D&B) Number. If no natural key exists, we can create a virtual key. A virtual key is a unique numeric counter used when it is not possible to identify an entity instance. For example, a manufacturing company has the concept of an adjustment. An adjustment is when a pallet is moved between shelves in a warehouse. The business has no way of identifying this kind of transaction, so the virtual key adjustment identifier is created.

Create a Surrogate if Multiple Sources Exist
If the entity you are modeling contains data from multiple sources, there is usually value in creating a surrogate key. For example, assume you learn that both Henry Winkler the student and Hank Winkler the instructor are really the same person. You can add this person as a unique row to the person entity identified by a person identifier surrogate key. This person can then play the two roles of instructor and student.

Create a Surrogate if the Natural Key is Inefficient
If the natural key is inefficient, create a surrogate key. Inefficient means that the data elements which make up the natural key have performance or space issues when joining across tables. This surfaces most frequently when joining a reference to a transaction table, such as a dimension to a fact table. For example, the business might identify a promotion by a promotion code and promotion start date. Although the code and start date have little impact on a promotion table containing only a few thousand rows, there will be an impact on the multimillion row order and credit tables that reference promotion.

Create a Surrogate if the Natural Key Values are Recycled
While working with a manufacturing plant, I learned that one of their key business concepts is identified by a code whose values repeat every 18 months. We decided to use a surrogate key as a buffer layer and create an alternate key on this natural key code plus a date. As an aside, it is a very good practice to always define an alternate key on the natural key when using a surrogate.

Every set of guidelines has exceptions, and for surrogate keys a very real exception is the limitation of application development tools. There is technology in use that requires surrogates while other tools prohibit them.

Steve Hoberman is one of the world's most well-known data modeling gurus. He taught his first data modeling class in 1992 and has educated more than 10,000 people about data modeling and business intelligence techniques since then.

Problem with History - a look at surrogate key usage

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.



Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.



The Problem with History
Performing an initial load of a type 2 dimension from a history-tracking source
A source system that keeps its own history presents interesting problems when used to load a type 2 history-tracking dimension table. The problem I am referring to arose in an actual project where we had a source system (SAP) which was keeping its own history of changes to an entity in a comprehensive, audit-trail sort of way. It appeared to have been tracking every change to the entity in question. But, in the data warehouse, for the corresponding dimension we only wanted to track history for a subset of attributes, treating the rest as type 1 attributes. This article focuses on the initial load, not the continuing maintenance, of a dimension from a set of history-tracking source records.

Let’s say that, in the source system (an HR application, for instance), we have the records shown in Figure 1.
 
Picture (Metafile)

In this table, EMPKEY is the "person key" or "person ID, " which is a surrogate key in the source system itself. Social Security NUmber, which is often thought of as a key, is a simple attribute. The primary key of the table is EMPKEY + VALID_FROM. A real HR system would obviously have more attributes than this, but, for purposes of this article, all we need are a mix of type 1 and type 2 attributes.

The table tells a tale. Jo, a female, living in Michigan, gets hired on 12/3/1998. On 12/27/1998, the HR staffers discover they entered her SSN incorrectly and change it; the source system automatically creates a new record, with a new VALID_FROM date of 12/28/2008. Seven years go by until, in April of 2005, Jo, who’s always felt uncomfortable as a female, goes into the online HR portal and changes her first name to "Joe"; the source system dutifully tracks the change. On August 8, Joe finally gets her operation, and an HR staff member changes her > his sex from "F" to "M," and her > his name to "Joseph." On February 13, Joseph decides he really prefers to go by "Joe" again, until July 5, when he flips his name back to "Joseph" and transfers to the company’s Big Apple office. On Christmas Eve, weary of the brusque and chilly streets of Manhattan, Joseph returns to Michigan. Jim or James, the other person in this table, has a simpler history: he changes from "Jim" to "James" on 3/16/2004 and then moves to Indiana on 6/23/2007. He’s apparently let go on 8/31/2007. (In this HR system, if your latest record doesn’t have a NULL in "VALID_TO," it means you’re terminated as of that date.)

The business users have stated that they don’t care about tracking or reporting on historical changes to first names or SSNs, but they do care about sex and state. In data warehousing terms, FIRSTNAME and SSN are type 1 attributes, and SEX and STATE are type 2.

We are doing an initial load of our PERSON dimension. Again, we do not cover, in this article, how to deal with comparisons between a history-tracking, type 2-ish source and an already-populated dimension table.

Ultimately, our PERSON dimension table should look like Figure 2.

Picture (Metafile)

DIM_K is just a standard dimension table surrogate key; only those records for the two people in question are displayed. Note that where we have a currently active record with no "natural" or source end date (the employee is still employed, the person is still alive, etc.), we enter a fake "high date" to make querying and joins easier. (This sort of thing will, doubtless, cause a massive "Year 9999 Problem.") Our PERSON dimension table tracks history for those attributes we care about, but washes out the history for the type 1 attributes. To get to this blissful state, we need to sift out the type 1 changes from the type 2 and only make new dimension records where we have a type 2 change.

In what follows, we’ll stick to fairly simple set logic. We’ll use no cursors or other looping-type constructs. While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

Our first step is to order and number our records. Depending on what system you’re using, there will be any number of ways to do this, but start by ordering your source records by EMPKEY, VALID_FROM and add a row numbering RANK column (see Figure 3).

Picture (Metafile)

Next, we’ll need the earliest and latest records for each natural key. The earliest ones will form a sort of "base" for future joins; the latest ones carry the values we’ll use for the type 1 attributes. (The business rule here is that the latest records hold the type 1 attributes to use; other rules would demand different coding.)

NK_EARLIEST
select

s.*
from
src_ranked s inner join (select key, min(rank) r from src_ranked) mins
on s.rank = mins.r
Picture (Metafile)
NK_LATEST
select
s.*
from
src_ranked s inner join (select key, max(rank) r from src_ranked) maxes
on s.rank = maxes.r
Picture (Metafile)
The existence of this NK_LATEST table will serve us well later on.
Next, let’s make a table of the type 2 columns only. Within these we’ll need to sift out the records for a given natural key where there’s no change.

TYPE2_COLS
select
rank
,key
,sex
,state
,valid_from
,valid_to
from
src_ranked
order by
rank
Picture (Metafile)
In the table in Figure 6, the columns values that will form the VALID_FROM’s in our dimension are highlighted in red, and those that will supply the VALID_TO are blue.

For our next trick, inner join this table to itself, each record to the record one above it, in rank, where the key (natural key) is the same and where there’s something different about the type 2 attributes:

TYPE2_CHANGES
select
t2c1.rank rank1
,t2c1.key key1
,t2c1.sex sex1
,t2c1.state state1
,t2c1.valid_from valid_from1
,t2c1.valid_to valid_to1
,t2c2.rank rank2
,t2c2.key key2
,t2c2.sex sex2
,t2c2.state state2
,t2c2.valid_from valid_from2
,t2c2.valid_to valid_to2
from
type2_cols t2c1 inner join type2_cols t2c2
on t2c1.rank + 1 = t2c2.rank
and t2c1.key = t2c2.key
where
t2c1.sex <> t2c2.sex or t2c1.state <> t2c2.state
To view Figure 7 see PDF below.
We can see that we’re very close to what we’re looking for in the VALID_TO1 and VALID_FROM2 columns. But what we want (leaving out the irrelevant columns) is shown in Figure 8.

Picture (Metafile)
To get there, do a quasi-pivot of TYPE2_CHANGES, like this:
TYPE2_C_PIV
select * from
(
select
rank1 rank
,key1 key
,sex1 sex
,state1 state
,valid_to1 valid_date
from
type2_changes
union
select
rank2 rank
,key2 key
,sex2 sex
,state2 state
,valid_from2 valid_date
from
type2_changes
) s
order by rank
Picture (Metafile)
Add the handy-dandy, presaved NK_EARLIEST and NK_LATEST records back in, and order by rank:
TYPE2_C_ALL
select * from
(
select * from type2_c_piv
union
select
rank
,key
,sex
,state
,valid_from valid_date
from nk_earliest
union
select
rank
,key
,sex
,state
,IsNull(valid_to,’12/31/9999’) valid_date
from nk_latest
)
order by rank, valid_date
Picture (Metafile)

Put an ordinal or row number on these rows again:

Picture (Metafile)
The odd-numbered records contain the VALID_FROM dates, and the next-higher-numbered row contains the VALID_TO dates. Join ‘em, Dano. The following assumes the availability of a mod (modulus) operator to find the odd-numbered rows; there are certainly other ways to find odd numbers.

PRE_DIM
select
t1.key
,t1.sex
,t1.state
,t1.valid_date ‘valid_from’
,t2.valid_date ‘valid_to’
from
type2_c_all_ranked t1 inner join type2_c_all_ranked t2
on t1.new_rank + 1 = t2.new_rank
where
mod (t1.new_rank,2) = 1
Picture (Metafile)
The remainder is trivial. Join to the "latest" records on the natural key to pick up the type 1 dimension, and, while it’s handy, set the current flag:

Add_T1 ("add type 1")
select
pd.key as ‘nat_k’
,l1.first_name
,l1.ssn
,pd.sex
,pd.state
,pd.valid_from
,pd.valid_to
,case when pd.valid_from = l.valid_from then ‘Y’ else ‘N’ end as ‘current_flag’
from
pre_dim pd inner join nk_latest l
on pd.key = l.key
Then, generate surrogate keys for your target dimension table and load it.