By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
    benefits of data analytics for financial industry
    Fascinating Changes Data Analytics Brings to Finance
    7 Min Read
    analyzing big data for its quality and value
    Use this Strategic Approach to Maximize Your Data’s Value
    6 Min Read
    data-driven seo for product pages
    6 Tips for Using Data Analytics for Product Page SEO
    11 Min Read
    big data analytics in business
    5 Ways to Utilize Data Analytics to Grow Your Business
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Adventures in Data Profiling (Part 8)
Share
Notification Show More
Latest News
cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security
DevOps on cloud
Optimizing Cost with DevOps on the Cloud
Development
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Adventures in Data Profiling (Part 8)
Data Mining

Adventures in Data Profiling (Part 8)

JimHarris
Last updated: 2009/12/01 at 3:36 AM
JimHarris
14 Min Read
SHARE
- Advertisement -

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis. This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Contents
Adventures in Data ProfilingAdventures in Data ProfilingCounts and PercentagesValues and FormatsDrill-down AnalysisConclusionThank YouThe Complete Series

Data profiling can help you perform essential analysis such as:

- Advertisement -
  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing …

More Read

analyzing big data for its quality and value

Use this Strategic Approach to Maximize Your Data’s Value

7 Data Lineage Tool Tips For Preventing Human Error in Data Processing
Preserving Data Quality is Critical for Leveraging Analytics with Amazon PPC
Quality Control Tips for Data Collection with Drone Surveying
3 Huge Reasons that Data Integrity is Absolutely Essential


Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis. This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

- Advertisement -
  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics. Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered. Both the data profiling tool and data used throughout the series were fictional. The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

- Advertisement -
  1. Counts and Percentages
  2. Values and Formats
  3. Drill-down Analysis

Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

Data Profiling Summary

  • NULL – count of the number of records with a NULL value 
  • Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces) 
  • Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing) 
  • Completeness – percentage calculated as Actual divided by the total number of records 
  • Cardinality – count of the number of distinct actual values 
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records 
  • Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records. In Part 6, Account Number and Tax ID were used as examples. Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record). The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata). Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.

- Advertisement -

Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
  • Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
  • Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
  • Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
  • Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

- Advertisement -

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record). 

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field. Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid. 

Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest. Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage. As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series. However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).

Conclusion

Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative. I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions. 

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is to not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality. All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important – translate your analysis into meaningful reports and questions to share with the rest of your team. 

Always remember that well performed data profiling is both a highly interactive and a very iterative process.

Thank You

I want to thank you for providing your feedback throughout this series. 

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments. 

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.

The Complete Series

  • Adventures in Data Profiling (Part 1) – Series Introduction 
  • Adventures in Data Profiling (Part 2) – Customer ID and Gender Code 
  • Adventures in Data Profiling (Part 3) – Birth Date, Telephone Number and E-mail Address 
  • Adventures in Data Profiling (Part 4) – City Name, State Abbreviation, Zip Code and Country Code 
  • Adventures in Data Profiling (Part 5) – Postal Address Line 1 and Postal Address Line 2 
  • Adventures in Data Profiling (Part 6) – Account Number and Tax ID 
  • Adventures in Data Profiling (Part 7) – Customer Name 1 and Customer Name 2

Link to original post

TAGGED: data profiling, data quality
JimHarris December 1, 2009
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
- Advertisement -

Follow us on Facebook

Latest News

cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

analyzing big data for its quality and value
Big Data

Use this Strategic Approach to Maximize Your Data’s Value

6 Min Read
Big Data

7 Data Lineage Tool Tips For Preventing Human Error in Data Processing

6 Min Read
data quality and role of analytics
Data Quality

Preserving Data Quality is Critical for Leveraging Analytics with Amazon PPC

8 Min Read
data collection with drone use
Data Collection

Quality Control Tips for Data Collection with Drone Surveying

9 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?