Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
    data analytics for trademark registration
    Optimizing Trademark Registration with Data Analytics
    6 Min Read
    data analytics for finding zip codes
    Unlocking Zip Code Insights with Data Analytics
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Turbo-Charge Data Scientist Productivity with a Data Catalog
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Quality > Turbo-Charge Data Scientist Productivity with a Data Catalog
Big DataData QualityPolicy and Governance

Turbo-Charge Data Scientist Productivity with a Data Catalog

AndrewAhn
AndrewAhn
8 Min Read
data catalog big data quality
Shutterstock Licensed Photo - By mei yanotai
SHARE

The average salary of a data scientist in the U.S. is nearly $130,000, a figure that’s bound to climb as the shortage of people with the requisite skills persists. With that kind of investment at stake, any company would want to get the maximum value out of its skills investment, but by most accounts, data scientists typically spend 80% of their time on the routine and monotonous tasks of finding and organizing data.

Contents
Why Data Catalogs Are the SolutionHow Catalogs Ease the Burden on Data ScientistsStricter Privacy Rules Make Data Catalogs Even More Important

They have no choice. Corporations have adopted data lakes enthusiastically, but without good governance and quality control procedures, those data lakes quickly become data swamps. Duplication, inconsistency, omissions, data quality issues, format incompatibilities, acceptable use policies, and permission problems are just some of the obstacles data scientists must navigate to whip information into shape so they can do the analyses and find the insights that matter to the business.

And that’s if they can find the data in the first place. In many organizations, silos have grown up over the years that make important data difficult or impossible to track down. Even if data scientists can locate the right information, they may wait weeks for the owners to make it available. Then begins the laborious task of correcting errors, harmonizing formats, filling in gaps, and resolving conflicts. It’s not surprising that this grunt work can consume most of an expensive data scientist’s time.

Why Data Catalogs Are the Solution

Organizations that are serious about data science need to be serious about data catalogs. Today’s technology enables machines to discover and classify data wherever it lives in the organization. And machine learning technology makes catalogs smarter as they work. With a little help from a human to resolve questions and inconsistencies, data catalogs can quickly learn to make their own decisions without human intervention.

More Read

Data Quality Whitepapers are Worthless
4 Business Risks That Might Prevent Big Data ROI
Discovering the Wonders of Data-Driven PPC Marketing
Big Data: We Have the Technology, but Do We Have the People?
The World Of Augmented Reality: 5 Unconventional Uses Of AR

A good rule of thumb is to assume that 80% of the effort is going to center around data-integration activities… A similar 80% of the effort within data integration is to identify and profile data sources.

— Boris Evelson, Forrester Research, March 25, 2015

Forrester Research: Boost Your Business Insights By Converging Big Data And BI

Data catalogs help data scientists in areas other than just information discovery. They’re one of the best ways to identify duplicate or inconsistent information, cutting down on a laborious human task. Tags applied automatically or by humans through crowdsourcing can help data scientists decide if a given dataset is useful or extraneous without requiring them to dig into the data itself. The catalog can also indicate permissions and data governance standards that tell whether it’s OK to use a given set of records.

How Catalogs Ease the Burden on Data Scientists

Data swamps present a formidable challenge to data scientists. Without a clear definition of data types, intended usage, and quality rating, scientists are left to make their best guess about what to use and what to disregard.

Unfortunately, poor data quality is a rampant problem. Experian’s 2017 Global Data Management Benchmark Report found that fewer than half of the organizations surveyed trust their data to make important business decisions. The most frequently cited cause of poor data quality is human error, such as sloppy data entry. Then there is poorly identified data. For example, a string of eight digits may be a partial phone number, a Social Security number, an account number, or a date. A smart data catalog can discover and tag the information that’s most relevant to the task, eliminating guesswork and the risk of bad decisions.

Copy sprawl is another challenge. In a perfect world, organizations would have only one “golden copy” of their data, but the reality is that duplication is rampant in most organizations. Sales managers want customer data to populate their customer relationship management systems. Marketing wants it for a lead nurturing program. The support team wants it to build their service history database.

International Data Corp. has estimated that up to 60% of storage capacity in a typical enterprise consists of these kinds of copies, but fewer than 20% of organizations have copy-management standards. Gartner analyst Dave Russell estimates many companies keep between 30 and 40 copies of business data for purposes ranging from backups to regulatory compliance.

As each group gets its own extract of production data, the costs and risks grow. Updates to one copy aren’t reflected in the others, creating discontinuity. No one knows what the truth is, which makes analyzing data for critical business decisions a risky affair.

An enterprise data catalog brings order out of this chaos by “fingerprinting” data and tagging backups and extracts so that there’s never any confusion about which copies are valid. A catalog doesn’t prevent copies from being made, but it can designate ownership, flag data that’s been modified, and even specify rules about how those copies can be used.

Stricter Privacy Rules Make Data Catalogs Even More Important

The need for a data catalog will become even more pronounced as new privacy rules take effect in Europe and elsewhere. These regulations place strict limits on how personal data may be used for purposes like profiling and segmentation. Information may need to be anonymized or deleted depending on the permissions that have been granted by the subject individual. This directly impacts the types of data science applications that can be used.

For example, a marketing organization may want to target promotions at individual households. Residents who have given permission for such contact may receive customized offers, while those who haven’t may receive only general promotions or may not be contacted at all. A data catalog can specify at a fine level of granularity what kinds of information may be used for targeting, thereby avoiding large fines for the company. The data scientist is protected when legitimate usage is defined by the data catalog.

Data catalogs set the ground rules for how data is stored and labeled across an organization. This is particularly useful for companies that have grown rapidly through mergers and acquisitions, a phenomenon that tends to stoke the data silo problem. Introducing a catalog gives those companies a chance to get a clean start with a unified view that applies to all data.

When you do the math, the benefits of a data catalog quickly exceed the costs. For example, if a catalog can save 30% of a data scientist’s time that’s currently wasted on searching and prepping, that’s $40,000 per year. And that’s not even taking into account the business benefits of having that person working in a satisfying, challenging job doing what you hired him or her for.

TAGGED:big databig data scientistscrowdsourcingData catalogsdata qualityData Sciencedata scientistsdata setmachine learning
Share This Article
Facebook Pinterest LinkedIn
Share
ByAndrewAhn
Follow:
Andrew Ahn is Vice President of Product Management for Waterline Data. He is an Apache Atlas committer and was the lead at Hortonworks for Hadoop governance strategy. Prior work includes product and governance responsibilities at ICE/NYSE Euronext, spanning 12 countries and 23 market centers.

Follow us on Facebook

Latest News

crypto marketing
How a Crypto Marketing Agency Can Use AI to Create Powerful Native Advertising Strategies
Blockchain Exclusive Marketing
data driven insights
How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
Analytics Big Data Exclusive
image fx (37)
Boosting SMS Marketing Efficiency with AI Automation
Exclusive
pexels pavel danilyuk 8112119
Data Analytics Is Revolutionizing Medical Credentialing
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

How to Use Big Data to Sell into Micromarkets

5 Min Read

Top 14 Business Intelligence predictions for 2012

30 Min Read
data science and business in big data
Big DataBusiness IntelligenceData ScienceExclusive

The Connection Between Data Science And Business In Big Data

6 Min Read
big data for self-storage
Big DataExclusive

5 Ways Big Data Is Impacting The Self-Storage Industry

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?