Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Getting Your Data Freq On
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Getting Your Data Freq On
Data Mining

Getting Your Data Freq On

JimHarris
JimHarris
9 Min Read
SHARE

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Contents
Freq’ing Awesome AnalysisHow can a data profiling tool help you? Let me count the . …Freq’ing Awesome AnalysisHow can a data profiling tool help you? Let me count the waysShow Me the Value (or the Format)ConclusionRelated Posts

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq’ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

More Read

Can We Learn From Anti-Social Users?
Igniting the New Intelligence
Blog Boy wins…
Revealing Human Nature through Social Media Measurement
Getting Ready for the Post-Season: Numerati Baseball
  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the . …

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq’ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

  • NULL – count of the number of records with a NULL value
  • Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
  • Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
  • Completeness – percentage calculated as Actual divided by the total number of records
  • Cardinality – count of the number of distinct actual values
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records
  • Distinctness – percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns. However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes. Consistently representing the absence of data is a common data quality standard.

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential
for duplicate records.  For example, a Tax ID field may be less than
100% complete (i.e. not every record has one) and therefore also less
than 100% unique (i.e. it can not be considered a potential single
primary key because it can not be used to uniquely identify every
record).  If the Tax ID field is also less than 100% distinct (i.e.
some distinct actual values occur on more than one record), then this
could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality (i.e. indicating potential default values)
  • Fields with a relatively low cardinality (e.g. gender code and source system code)
  • Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
  • Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
  • Fields with free-form values and a high cardinality  (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers). 

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

Conclusion

Understanding your data is essential to using it effectively and improving its quality.  In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly – translate your analysis into meaningful reports and questions to share with the rest of the project team. Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project. This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality. All of which are necessary to progress from profiling your data to performing a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and…

Get your data freq on! 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures
in Data Profiling (Part 4)

Schrödinger’s Data Quality

Data Gazers

Link to original post

TAGGED:data profilingdata quality
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Sun Tzu and the Art of Data Quality (Part 3)

5 Min Read

“Entry Point: The Call Center or the Death Star

8 Min Read

La Trahison des Données

6 Min Read

DQ is 1/3 Process Knowledge + 1/3 Business Knowledge + 1/3 Intuition

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?