Referential Treatment – The Open Source Reference Data Trend

February 22, 2010
160 Views

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you…

Reference data can be used in a huge number of data quality and data enrichment processes.  The simplest example is a table that contains cities and their associated postal codes – you can use an ETL process to make sure that all your customer records that contain 02026 for a postal code always refer to the standardized “Dedham, MA” for the city and state, not variations like “Deadham Mass”  or “Dedam, Massachusetts”.

Reference data is not limited to customer address, however. If everyone were to use the same reference data for parts, you could easily exchange procurement data between partners.  If only certain values are allowed in any given table, it would support validation.  By having standards for supply chain data, procurement, supply chain, finance and accounting data, processes are more efficient.  Organizations like the ISO and ECCMA are working on that.

Availability of Reference Data
In the past, it was difficult to get your hands on reference data. Long ago, no one wanted to share reference data with you – you had to send your customer data to a service provider and get the enriched data back.  Others struggled to develop reference data on their own. Lately I’m seeing more and more high quality reference data available for free on the Internet.   For data jockeys, these are good times.

GeoNames
A good example of this is GeoNames.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license. According to the web site, it “aggregates over 100 different data sets to build a list containing over eight million geographical names and consists of 7 million unique features whereof 2.6 million populated places and 2.8 million alternate names. The data is accessible free of charge through a number of web services and a daily database export. “

GeoNames combines geographical data such as names of places in various languages, elevation, population and others from various sources. All lat/long coordinates are in WGS84 (World Geodetic System 1984). Like Wikipedia, users may manually edit, correct and add new names.

US Census Data
Another rich set of reference data is the US Census “Gazetteer” data. Courtesy of the US government, you can download a database with the following fields:

  • Field 1 – State Fips Code
  • Field 2 – 5-digit Zipcode
  • Field 3 – State Abbreviation
  • Field 4 – Zipcode Name
  • Field 5 – Longitude in Decimal Degrees (West is assumed, no minus sign)
  • Field 6 – Latitude in Decimal Degrees (North is assumed, no plus sign)
  • Field 7 – 2000 Population (100%)
  • Field 8 – Allocation Factor (decimal portion of state within zipcode)

So, our Dedham, MA entry includes this data:

  • “25”,”02026″,”MA”,”DEDHAM”,71.163741,42.243685,23782,0.003953

It’s Really Exciting!
When I talk about reference data at parties, I immediately see eyes glaze over and it’s clear that my fellow party-goers want to escape my enthusiasm for it.  But this availability of reference data is really great news! Together with the open source data integration tools like Talend Open Studio, we’re starting to see what I like to call “open source reference data” becoming available. It all makes the price of improving data quality much lower and our future much brighter.

There’s so much to talk about with regard to reference data and so many good sources.  I plan to make more posts on this topic, but feel free to post your beloved reference data sources here in the comments section.

Link to original post