Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Reference Domains Part III: Collecting Classifications
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > Reference Domains Part III: Collecting Classifications
AnalyticsBest PracticesData Warehousing

Reference Domains Part III: Collecting Classifications

zamaes
zamaes
8 Min Read
SHARE
This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.
 
The compilation of a master list of reference domains and their values straddles the analysis and design phases of development.
This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.
 
The compilation of a master list of reference domains and their values straddles the analysis and design phases of development. For external classifications, those captured from the source system, the work must be done in the analysis phase, while internal classifications, being dependent on the design of the target data structures, are collected during the design phase.
 
As mentioned in part one, the external reference domains may first be encountered as code values in a source table; perhaps a status of a bank account, or a categorization of a customer. If you’re lucky, the word “code”, “type”, or “indicator” will appear in the column name. But it’s equally possible that the name will not make it immediately apparent that this is a classification. The analyst will need to determine it from the documentation or from the values contained within the data. Cryptic or abbreviated codes can take the form of numbers or character-based strings. It is important for the analyst to read the clues in the table layout, the documentation and the data to identify potential classifications and then follow up with the data steward or external provider of the source data to find the descriptions that explain the codes.
 
The code descriptions may accompany the codes in the same source table. However, even if this is the case, selecting a distinct set of the codes and descriptions from the table may not return a full set of potential values, only those that are currently being used. For example, pulling a distinct set of status types from a bank account table may return the values “open” and “closed”, but miss the fact that sometimes accounts are briefly “pending”. For this reason, it’s very important to find either a master table in the source that contains the full set of possible codes and values, or the documentation where the full set is listed. In either case, it’s essential to examine the list of occurring values and compare it against the master. The master may be out of sync with the newest list of allowable values, or there may be aberrant values appearing in the data which will need to be dealt with.
 
In some cases, no descriptions will exist. Business users may know the meaning of the codes, without requiring the definition. In these cases, it is best practice to create the description and include it in the reference value table. Codes should have explicit descriptions wherever possible, to avoid miscommunication and potential errors. It is a simple matter to misinterpret a status code of “A” as “Active”, when it actually means “Archived”.
  
Internal reference domains are quite different, in that they are entirely in the control of the EDW design team. The codes and descriptions are all generated and maintained internally. They take the same form of the external classifications, being codes and descriptions, and need to be compiled together with them in a master document and stored in the same reference tables in the EDW.
For compiling the master list, I recommend a pair of spreadsheets, one with a list of the reference domains (e.g., Account Type, Marital Status etc.) and the other a list of reference values for each domain. Here are some points to consider as you create and maintain this list:
  1. Manually set the surrogate keys
    Rather than simply pass the setting of surrogate keys over to the ETL process, this master list is an opportunity to control the keys being applied. In most cases, each domain will contain a relatively short list of values (<50) and this will be a simple task. The control and predictability it supplies can be useful for development, testing and usage of the database.
  2. Reserve ranges of values for the surrogate keys
    It’s a good idea to leave room between the domains of values to allow the reference domain to grow or be corrected over time. These ranges will not be uniform; one domain will reserve twenty sequential numbers while another will require a thousand.
  3. Create dummy values for “Unknowns”
    Many times, incoming records will not find a match to the allowable domain of values. Rather than leave these values null, a surrogate key corresponding to an “unknown” reference value within the domain can be applied.
  4. Create dummy values at finer grain
    In some cases it may be prudent to distinguish between different types of “unknown” values. For example, records for which a classification is not required could be marked “Not Applicable”, while an incoming value that is out of the allowable range could be assigned “Out of Range”. When applying this finer grain consideration should be given to what is practical as well as what will add value.
  5. Create load-ready spreadsheet
    Some external classifications can be loaded from tables, while others, along with all the internal classifications will need to be manually entered. In both cases, a spreadsheet can be used to control the process. I have seen an implementation where a single spreadsheet contained both manual entries and references to tables, and was used as input to the ETL load process.

In the final part of this series we will look at the collection of metadata related to reference domains and how it can be used to assist the governance of data content.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

big data fitness certifications
Big Data

Data Analytics Proves ROI of NASM & ACE Fitness Certifications

7 Min Read

The Butterfly Effect and Data Quality

4 Min Read

Zero Latency: An Obsession with Velocity

5 Min Read
Image
Analytics

Understanding Your Business With Descriptive, Predictive and Prescriptive Analytics [VIDEO]

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?