By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data science anayst
    Growing Demand for Data Science & Data Analyst Roles
    6 Min Read
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Turbo-Charge Data Scientist Productivity with a Data Catalog
Share
Notification Show More
Latest News
ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Quality > Turbo-Charge Data Scientist Productivity with a Data Catalog
Big DataData QualityPolicy and Governance

Turbo-Charge Data Scientist Productivity with a Data Catalog

AndrewAhn
Last updated: 2018/03/22 at 8:36 PM
AndrewAhn
8 Min Read
data catalog big data quality
Shutterstock Licensed Photo - By mei yanotai
SHARE

The average salary of a data scientist in the U.S. is nearly $130,000, a figure that’s bound to climb as the shortage of people with the requisite skills persists. With that kind of investment at stake, any company would want to get the maximum value out of its skills investment, but by most accounts, data scientists typically spend 80% of their time on the routine and monotonous tasks of finding and organizing data.

Contents
Why Data Catalogs Are the SolutionHow Catalogs Ease the Burden on Data ScientistsStricter Privacy Rules Make Data Catalogs Even More Important

They have no choice. Corporations have adopted data lakes enthusiastically, but without good governance and quality control procedures, those data lakes quickly become data swamps. Duplication, inconsistency, omissions, data quality issues, format incompatibilities, acceptable use policies, and permission problems are just some of the obstacles data scientists must navigate to whip information into shape so they can do the analyses and find the insights that matter to the business.

And that’s if they can find the data in the first place. In many organizations, silos have grown up over the years that make important data difficult or impossible to track down. Even if data scientists can locate the right information, they may wait weeks for the owners to make it available. Then begins the laborious task of correcting errors, harmonizing formats, filling in gaps, and resolving conflicts. It’s not surprising that this grunt work can consume most of an expensive data scientist’s time.

Why Data Catalogs Are the Solution

Organizations that are serious about data science need to be serious about data catalogs. Today’s technology enables machines to discover and classify data wherever it lives in the organization. And machine learning technology makes catalogs smarter as they work. With a little help from a human to resolve questions and inconsistencies, data catalogs can quickly learn to make their own decisions without human intervention.

More Read

data security in big data age

6 Reasons to Boost Data Security Plan in the Age of Big Data

Growing Demand for Data Science & Data Analyst Roles
How Big Data Is Transforming the Maritime Industry
Boosting Your Chances for Landing a Job as a Data Scientist
Utilizing Data to Discover Shortcomings Within Your Business Model

A good rule of thumb is to assume that 80% of the effort is going to center around data-integration activities… A similar 80% of the effort within data integration is to identify and profile data sources.

— Boris Evelson, Forrester Research, March 25, 2015

Forrester Research: Boost Your Business Insights By Converging Big Data And BI

Data catalogs help data scientists in areas other than just information discovery. They’re one of the best ways to identify duplicate or inconsistent information, cutting down on a laborious human task. Tags applied automatically or by humans through crowdsourcing can help data scientists decide if a given dataset is useful or extraneous without requiring them to dig into the data itself. The catalog can also indicate permissions and data governance standards that tell whether it’s OK to use a given set of records.

How Catalogs Ease the Burden on Data Scientists

Data swamps present a formidable challenge to data scientists. Without a clear definition of data types, intended usage, and quality rating, scientists are left to make their best guess about what to use and what to disregard.

Unfortunately, poor data quality is a rampant problem. Experian’s 2017 Global Data Management Benchmark Report found that fewer than half of the organizations surveyed trust their data to make important business decisions. The most frequently cited cause of poor data quality is human error, such as sloppy data entry. Then there is poorly identified data. For example, a string of eight digits may be a partial phone number, a Social Security number, an account number, or a date. A smart data catalog can discover and tag the information that’s most relevant to the task, eliminating guesswork and the risk of bad decisions.

Copy sprawl is another challenge. In a perfect world, organizations would have only one “golden copy” of their data, but the reality is that duplication is rampant in most organizations. Sales managers want customer data to populate their customer relationship management systems. Marketing wants it for a lead nurturing program. The support team wants it to build their service history database.

International Data Corp. has estimated that up to 60% of storage capacity in a typical enterprise consists of these kinds of copies, but fewer than 20% of organizations have copy-management standards. Gartner analyst Dave Russell estimates many companies keep between 30 and 40 copies of business data for purposes ranging from backups to regulatory compliance.

As each group gets its own extract of production data, the costs and risks grow. Updates to one copy aren’t reflected in the others, creating discontinuity. No one knows what the truth is, which makes analyzing data for critical business decisions a risky affair.

An enterprise data catalog brings order out of this chaos by “fingerprinting” data and tagging backups and extracts so that there’s never any confusion about which copies are valid. A catalog doesn’t prevent copies from being made, but it can designate ownership, flag data that’s been modified, and even specify rules about how those copies can be used.

Stricter Privacy Rules Make Data Catalogs Even More Important

The need for a data catalog will become even more pronounced as new privacy rules take effect in Europe and elsewhere. These regulations place strict limits on how personal data may be used for purposes like profiling and segmentation. Information may need to be anonymized or deleted depending on the permissions that have been granted by the subject individual. This directly impacts the types of data science applications that can be used.

For example, a marketing organization may want to target promotions at individual households. Residents who have given permission for such contact may receive customized offers, while those who haven’t may receive only general promotions or may not be contacted at all. A data catalog can specify at a fine level of granularity what kinds of information may be used for targeting, thereby avoiding large fines for the company. The data scientist is protected when legitimate usage is defined by the data catalog.

Data catalogs set the ground rules for how data is stored and labeled across an organization. This is particularly useful for companies that have grown rapidly through mergers and acquisitions, a phenomenon that tends to stoke the data silo problem. Introducing a catalog gives those companies a chance to get a clean start with a unified view that applies to all data.

When you do the math, the benefits of a data catalog quickly exceed the costs. For example, if a catalog can save 30% of a data scientist’s time that’s currently wasted on searching and prepping, that’s $40,000 per year. And that’s not even taking into account the business benefits of having that person working in a satisfying, challenging job doing what you hired him or her for.

TAGGED: big data, big data scientists, crowdsourcing, Data catalogs, data quality, Data Science, data scientists, data set, machine learning
AndrewAhn March 23, 2018
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
By AndrewAhn
Follow:
Andrew Ahn is Vice President of Product Management for Waterline Data. He is an Apache Atlas committer and was the lead at Hortonworks for Hadoop governance strategy. Prior work includes product and governance responsibilities at ICE/NYSE Euronext, spanning 12 countries and 23 market centers.

Follow us on Facebook

Latest News

ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data security in big data age
Big Data

6 Reasons to Boost Data Security Plan in the Age of Big Data

7 Min Read
data science anayst
Data Science

Growing Demand for Data Science & Data Analyst Roles

6 Min Read
How Big Data Is Transforming the Maritime Industry
Big Data

How Big Data Is Transforming the Maritime Industry

8 Min Read
become a data scientist
Jobs

Boosting Your Chances for Landing a Job as a Data Scientist

9 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?