By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    customer experience analytics
    Using Data Analysis to Improve and Verify the Customer Experience and Bad Reviews
    6 Min Read
    data analytics and CRO
    Data Analytics is Crucial for Website CRO
    9 Min Read
    analytics in digital marketing
    The Importance of Analytics in Digital Marketing
    8 Min Read
    benefits of investing in employee data
    6 Ways to Use Data to Improve Employee Productivity
    8 Min Read
    Jira and zendesk usage
    Jira Service Management vs Zendesk: What Are the Differences?
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Getting Your Data Freq On
Share
Notification Show More
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Getting Your Data Freq On
Data Mining

Getting Your Data Freq On

JimHarris
Last updated: 2009/07/24 at 4:38 AM
JimHarris
9 Min Read
SHARE

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Contents
Freq’ing Awesome AnalysisHow can a data profiling tool help you? Let me count the . …Freq’ing Awesome AnalysisHow can a data profiling tool help you? Let me count the waysShow Me the Value (or the Format)ConclusionRelated Posts

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq’ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the . …

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq’ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

  • NULL – count of the number of records with a NULL value
  • Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
  • Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
  • Completeness – percentage calculated as Actual divided by the total number of records
  • Cardinality – count of the number of distinct actual values
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records
  • Distinctness – percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns. However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes. Consistently representing the absence of data is a common data quality standard.

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential
for duplicate records.  For example, a Tax ID field may be less than
100% complete (i.e. not every record has one) and therefore also less
than 100% unique (i.e. it can not be considered a potential single
primary key because it can not be used to uniquely identify every
record).  If the Tax ID field is also less than 100% distinct (i.e.
some distinct actual values occur on more than one record), then this
could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality (i.e. indicating potential default values)
  • Fields with a relatively low cardinality (e.g. gender code and source system code)
  • Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
  • Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
  • Fields with free-form values and a high cardinality  (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers). 

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

Conclusion

Understanding your data is essential to using it effectively and improving its quality.  In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly – translate your analysis into meaningful reports and questions to share with the rest of the project team. Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project. This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality. All of which are necessary to progress from profiling your data to performing a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and…

Get your data freq on! 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures
in Data Profiling (Part 4)

Schrödinger’s Data Quality

Data Gazers

Link to original post

TAGGED: data profiling, data quality
JimHarris July 24, 2009 July 24, 2009
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

Cloud-Based Marketing
Smart Video Bloggers Are Leveraging Cloud-Based Marketing Tools
Cloud Computing IT Marketing
technology and security
Technology in Physical Security: A Guide to Business Safety
Exclusive IT Security
ai for stopping credit card theft
AI Can Manage Credit Card Cybersecurity Risks
IT Security
ai can help with nurse burnout
Breakthroughs in AI Are Helping to Prevent Nurse Burnout
Artificial Intelligence Exclusive

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

big data and agile
Big DataExclusive

Startups Use Data and Agile for Portfolio Management

5 Min Read
analyzing big data for its quality and value
Big Data

Use this Strategic Approach to Maximize Your Data’s Value

6 Min Read
data lineage tool
Big Data

7 Data Lineage Tool Tips For Preventing Human Error in Data Processing

6 Min Read
data quality and role of analytics
Data Quality

Preserving Data Quality is Critical for Leveraging Analytics with Amazon PPC

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?