Monday, 27 June 2011

To Surrogate Key or Not

To Surrogate Key or Not
Design Challenge
Information Management Magazine, August 2006
A great way to sharpen our analysis and modeling skills is to continuously address real-world scenarios. A modeling scenario along with suggested solutions appears each month in this Design Challenge column. The scenario is emailed to more than 1,000 designers up to the challenge. Many of the responses, including my own, are consolidated here. If you would like to become a Design Challenger and have the opportunity to submit modeling solutions

Picture (Metafile)
The Response
A surrogate key is a unique identifier for a table, a numeric counter introduced early in the physical design stage. No meaning is implied by a surrogate key value. For example, you can't look at a month identifier of 1 and assume that it represents January. A surrogate key allows for more efficient navigation across structures and facilitates integration across applications. However, before telling the manager to always use surrogates, we need to consider the drawbacks: it can take more time and effort to navigate structures as well as add more complexity to the development phase. Therefore, we need guidelines in determining where it makes sense to create a surrogate key.

Create a surrogate key if the entity has a natural key and any of the following conditions hold:

  • There are multiple sources;
  • The natural key is inefficient; or
  • The natural key is recycled.

Surrogate versus Virtual
A natural key is ideally how the business would identify an entity instance, and a surrogate key is a substitute for this natural key. For example, a natural key for an organization could be taxpayer identifier or DUNS (D&B) Number. If no natural key exists, we can create a virtual key. A virtual key is a unique numeric counter used when it is not possible to identify an entity instance. For example, a manufacturing company has the concept of an adjustment. An adjustment is when a pallet is moved between shelves in a warehouse. The business has no way of identifying this kind of transaction, so the virtual key adjustment identifier is created.

Create a Surrogate if Multiple Sources Exist
If the entity you are modeling contains data from multiple sources, there is usually value in creating a surrogate key. For example, assume you learn that both Henry Winkler the student and Hank Winkler the instructor are really the same person. You can add this person as a unique row to the person entity identified by a person identifier surrogate key. This person can then play the two roles of instructor and student.

Create a Surrogate if the Natural Key is Inefficient
If the natural key is inefficient, create a surrogate key. Inefficient means that the data elements which make up the natural key have performance or space issues when joining across tables. This surfaces most frequently when joining a reference to a transaction table, such as a dimension to a fact table. For example, the business might identify a promotion by a promotion code and promotion start date. Although the code and start date have little impact on a promotion table containing only a few thousand rows, there will be an impact on the multimillion row order and credit tables that reference promotion.

Create a Surrogate if the Natural Key Values are Recycled
While working with a manufacturing plant, I learned that one of their key business concepts is identified by a code whose values repeat every 18 months. We decided to use a surrogate key as a buffer layer and create an alternate key on this natural key code plus a date. As an aside, it is a very good practice to always define an alternate key on the natural key when using a surrogate.

Every set of guidelines has exceptions, and for surrogate keys a very real exception is the limitation of application development tools. There is technology in use that requires surrogates while other tools prohibit them.

Steve Hoberman is one of the world's most well-known data modeling gurus. He taught his first data modeling class in 1992 and has educated more than 10,000 people about data modeling and business intelligence techniques since then.

No comments: