Reference Domains Part III: Collecting Classifications

This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.

The compilation of a master list of reference domains and their values straddles the analysis and design phases of development.

The compilation of a master list of reference domains and their values straddles the analysis and design phases of development. For external classifications, those captured from the source system, the work must be done in the analysis phase, while internal classifications, being dependent on the design of the target data structures, are collected during the design phase.

As mentioned in part one, the external reference domains may first be encountered as code values in a source table; perhaps a status of a bank account, or a categorization of a customer. If you’re lucky, the word “code”, “type”, or “indicator” will appear in the column name. But it’s equally possible that the name will not make it immediately apparent that this is a classification. The analyst will need to determine it from the documentation or from the values contained within the data. Cryptic or abbreviated codes can take the form of numbers or character-based strings. It is important for the analyst to read the clues in the table layout, the documentation and the data to identify potential classifications and then follow up with the data steward or external provider of the source data to find the descriptions that explain the codes.

The code descriptions may accompany the codes in the same source table. However, even if this is the case, selecting a distinct set of the codes and descriptions from the table may not return a full set of potential values, only those that are currently being used. For example, pulling a distinct set of status types from a bank account table may return the values “open” and “closed”, but miss the fact that sometimes accounts are briefly “pending”. For this reason, it’s very important to find either a master table in the source that contains the full set of possible codes and values, or the documentation where the full set is listed. In either case, it’s essential to examine the list of occurring values and compare it against the master. The master may be out of sync with the newest list of allowable values, or there may be aberrant values appearing in the data which will need to be dealt with.

In some cases, no descriptions will exist. Business users may know the meaning of the codes, without requiring the definition. In these cases, it is best practice to create the description and include it in the reference value table. Codes should have explicit descriptions wherever possible, to avoid miscommunication and potential errors. It is a simple matter to misinterpret a status code of “A” as “Active”, when it actually means “Archived”.

Internal reference domains are quite different, in that they are entirely in the control of the EDW design team. The codes and descriptions are all generated and maintained internally. They take the same form of the external classifications, being codes and descriptions, and need to be compiled together with them in a master document and stored in the same reference tables in the EDW.

For compiling the master list, I recommend a pair of spreadsheets, one with a list of the reference domains (e.g., Account Type, Marital Status etc.) and the other a list of reference values for each domain. Here are some points to consider as you create and maintain this list:

Manually set the surrogate keys
Rather than simply pass the setting of surrogate keys over to the ETL process, this master list is an opportunity to control the keys being applied. In most cases, each domain will contain a relatively short list of values (<50) and this will be a simple task. The control and predictability it supplies can be useful for development, testing and usage of the database.
Reserve ranges of values for the surrogate keys
It’s a good idea to leave room between the domains of values to allow the reference domain to grow or be corrected over time. These ranges will not be uniform; one domain will reserve twenty sequential numbers while another will require a thousand.
Create dummy values for “Unknowns”
Many times, incoming records will not find a match to the allowable domain of values. Rather than leave these values null, a surrogate key corresponding to an “unknown” reference value within the domain can be applied.
Create dummy values at finer grain
In some cases it may be prudent to distinguish between different types of “unknown” values. For example, records for which a classification is not required could be marked “Not Applicable”, while an incoming value that is out of the allowable range could be assigned “Out of Range”. When applying this finer grain consideration should be given to what is practical as well as what will add value.
Create load-ready spreadsheet
Some external classifications can be loaded from tables, while others, along with all the internal classifications will need to be manually entered. In both cases, a spreadsheet can be used to control the process. I have seen an implementation where a single spreadsheet contained both manual entries and references to tables, and was used as input to the ETL load process.

In the final part of this series we will look at the collection of metadata related to reference domains and how it can be used to assist the governance of data content.