By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
    benefits of data analytics for financial industry
    Fascinating Changes Data Analytics Brings to Finance
    7 Min Read
    analyzing big data for its quality and value
    Use this Strategic Approach to Maximize Your Data’s Value
    6 Min Read
    data-driven seo for product pages
    6 Tips for Using Data Analytics for Product Page SEO
    11 Min Read
    big data analytics in business
    5 Ways to Utilize Data Analytics to Grow Your Business
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Reference Domains Part III: Collecting Classifications
Share
Notification Show More
Latest News
cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security
DevOps on cloud
Optimizing Cost with DevOps on the Cloud
Development
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > Reference Domains Part III: Collecting Classifications
AnalyticsBest PracticesData Warehousing

Reference Domains Part III: Collecting Classifications

zamaes
Last updated: 2011/09/13 at 5:00 PM
zamaes
8 Min Read
SHARE
- Advertisement -
This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.
 
The compilation of a master list of reference domains and their values straddles the analysis and design phases of development.

This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.
 
The compilation of a master list of reference domains and their values straddles the analysis and design phases of development. For external classifications, those captured from the source system, the work must be done in the analysis phase, while internal classifications, being dependent on the design of the target data structures, are collected during the design phase.
 
As mentioned in part one, the external reference domains may first be encountered as code values in a source table; perhaps a status of a bank account, or a categorization of a customer. If you’re lucky, the word “code”, “type”, or “indicator” will appear in the column name. But it’s equally possible that the name will not make it immediately apparent that this is a classification. The analyst will need to determine it from the documentation or from the values contained within the data. Cryptic or abbreviated codes can take the form of numbers or character-based strings. It is important for the analyst to read the clues in the table layout, the documentation and the data to identify potential classifications and then follow up with the data steward or external provider of the source data to find the descriptions that explain the codes.
 
The code descriptions may accompany the codes in the same source table. However, even if this is the case, selecting a distinct set of the codes and descriptions from the table may not return a full set of potential values, only those that are currently being used. For example, pulling a distinct set of status types from a bank account table may return the values “open” and “closed”, but miss the fact that sometimes accounts are briefly “pending”. For this reason, it’s very important to find either a master table in the source that contains the full set of possible codes and values, or the documentation where the full set is listed. In either case, it’s essential to examine the list of occurring values and compare it against the master. The master may be out of sync with the newest list of allowable values, or there may be aberrant values appearing in the data which will need to be dealt with.
 
In some cases, no descriptions will exist. Business users may know the meaning of the codes, without requiring the definition. In these cases, it is best practice to create the description and include it in the reference value table. Codes should have explicit descriptions wherever possible, to avoid miscommunication and potential errors. It is a simple matter to misinterpret a status code of “A” as “Active”, when it actually means “Archived”.
  
Internal reference domains are quite different, in that they are entirely in the control of the EDW design team. The codes and descriptions are all generated and maintained internally. They take the same form of the external classifications, being codes and descriptions, and need to be compiled together with them in a master document and stored in the same reference tables in the EDW.
For compiling the master list, I recommend a pair of spreadsheets, one with a list of the reference domains (e.g., Account Type, Marital Status etc.) and the other a list of reference values for each domain. Here are some points to consider as you create and maintain this list:
  1. Manually set the surrogate keys
    Rather than simply pass the setting of surrogate keys over to the ETL process, this master list is an opportunity to control the keys being applied. In most cases, each domain will contain a relatively short list of values (<50) and this will be a simple task. The control and predictability it supplies can be useful for development, testing and usage of the database.
  2. Reserve ranges of values for the surrogate keys
    It’s a good idea to leave room between the domains of values to allow the reference domain to grow or be corrected over time. These ranges will not be uniform; one domain will reserve twenty sequential numbers while another will require a thousand.
  3. Create dummy values for “Unknowns”
    Many times, incoming records will not find a match to the allowable domain of values. Rather than leave these values null, a surrogate key corresponding to an “unknown” reference value within the domain can be applied.
  4. Create dummy values at finer grain
    In some cases it may be prudent to distinguish between different types of “unknown” values. For example, records for which a classification is not required could be marked “Not Applicable”, while an incoming value that is out of the allowable range could be assigned “Out of Range”. When applying this finer grain consideration should be given to what is practical as well as what will add value.
  5. Create load-ready spreadsheet
    Some external classifications can be loaded from tables, while others, along with all the internal classifications will need to be manually entered. In both cases, a spreadsheet can be used to control the process. I have seen an implementation where a single spreadsheet contained both manual entries and references to tables, and was used as input to the ETL load process.

In the final part of this series we will look at the collection of metadata related to reference domains and how it can be used to assist the governance of data content.

- Advertisement -

zamaes September 13, 2011
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
- Advertisement -

Follow us on Facebook

Latest News

cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

construction analytics
Analytics

5 Benefits of Analytics to Manage Commercial Construction

5 Min Read
database compliance guide
Data Management

Four Strategies For Effective Database Compliance

8 Min Read
benefits of data analytics for financial industry
Big Data

Fascinating Changes Data Analytics Brings to Finance

7 Min Read
analyzing big data for its quality and value
Big Data

Use this Strategic Approach to Maximize Your Data’s Value

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?