Turbo-Charge Data Scientist Productivity with a Data Catalog

data catalog big data quality
Shutterstock Licensed Photo - By mei yanotai

The average salary of a data scientist in the U.S. is nearly $130,000, a figure that’s bound to climb as the shortage of people with the requisite skills persists. With that kind of investment at stake, any company would want to get the maximum value out of its skills investment, but by most accounts, data scientists typically spend 80% of their time on the routine and monotonous tasks of finding and organizing data.

They have no choice. Corporations have adopted data lakes enthusiastically, but without good governance and quality control procedures, those data lakes quickly become data swamps. Duplication, inconsistency, omissions, data quality issues, format incompatibilities, acceptable use policies, and permission problems are just some of the obstacles data scientists must navigate to whip information into shape so they can do the analyses and find the insights that matter to the business.

And that’s if they can find the data in the first place. In many organizations, silos have grown up over the years that make important data difficult or impossible to track down. Even if data scientists can locate the right information, they may wait weeks for the owners to make it available. Then begins the laborious task of correcting errors, harmonizing formats, filling in gaps, and resolving conflicts. It’s not surprising that this grunt work can consume most of an expensive data scientist’s time.

Why Data Catalogs Are the Solution

Organizations that are serious about data science need to be serious about data catalogs. Today’s technology enables machines to discover and classify data wherever it lives in the organization. And machine learning technology makes catalogs smarter as they work. With a little help from a human to resolve questions and inconsistencies, data catalogs can quickly learn to make their own decisions without human intervention.

A good rule of thumb is to assume that 80% of the effort is going to center around data-integration activities… A similar 80% of the effort within data integration is to identify and profile data sources.

— Boris Evelson, Forrester Research, March 25, 2015

Forrester Research: Boost Your Business Insights By Converging Big Data And BI

Data catalogs help data scientists in areas other than just information discovery. They’re one of the best ways to identify duplicate or inconsistent information, cutting down on a laborious human task. Tags applied automatically or by humans through crowdsourcing can help data scientists decide if a given dataset is useful or extraneous without requiring them to dig into the data itself. The catalog can also indicate permissions and data governance standards that tell whether it’s OK to use a given set of records.

How Catalogs Ease the Burden on Data Scientists

Data swamps present a formidable challenge to data scientists. Without a clear definition of data types, intended usage, and quality rating, scientists are left to make their best guess about what to use and what to disregard.

Unfortunately, poor data quality is a rampant problem. Experian’s 2017 Global Data Management Benchmark Report found that fewer than half of the organizations surveyed trust their data to make important business decisions. The most frequently cited cause of poor data quality is human error, such as sloppy data entry. Then there is poorly identified data. For example, a string of eight digits may be a partial phone number, a Social Security number, an account number, or a date. A smart data catalog can discover and tag the information that’s most relevant to the task, eliminating guesswork and the risk of bad decisions.

Copy sprawl is another challenge. In a perfect world, organizations would have only one “golden copy” of their data, but the reality is that duplication is rampant in most organizations. Sales managers want customer data to populate their customer relationship management systems. Marketing wants it for a lead nurturing program. The support team wants it to build their service history database.

International Data Corp. has estimated that up to 60% of storage capacity in a typical enterprise consists of these kinds of copies, but fewer than 20% of organizations have copy-management standards. Gartner analyst Dave Russell estimates many companies keep between 30 and 40 copies of business data for purposes ranging from backups to regulatory compliance.

As each group gets its own extract of production data, the costs and risks grow. Updates to one copy aren’t reflected in the others, creating discontinuity. No one knows what the truth is, which makes analyzing data for critical business decisions a risky affair.

An enterprise data catalog brings order out of this chaos by “fingerprinting” data and tagging backups and extracts so that there’s never any confusion about which copies are valid. A catalog doesn’t prevent copies from being made, but it can designate ownership, flag data that’s been modified, and even specify rules about how those copies can be used.

Stricter Privacy Rules Make Data Catalogs Even More Important

The need for a data catalog will become even more pronounced as new privacy rules take effect in Europe and elsewhere. These regulations place strict limits on how personal data may be used for purposes like profiling and segmentation. Information may need to be anonymized or deleted depending on the permissions that have been granted by the subject individual. This directly impacts the types of data science applications that can be used.

For example, a marketing organization may want to target promotions at individual households. Residents who have given permission for such contact may receive customized offers, while those who haven’t may receive only general promotions or may not be contacted at all. A data catalog can specify at a fine level of granularity what kinds of information may be used for targeting, thereby avoiding large fines for the company. The data scientist is protected when legitimate usage is defined by the data catalog.

Data catalogs set the ground rules for how data is stored and labeled across an organization. This is particularly useful for companies that have grown rapidly through mergers and acquisitions, a phenomenon that tends to stoke the data silo problem. Introducing a catalog gives those companies a chance to get a clean start with a unified view that applies to all data.

When you do the math, the benefits of a data catalog quickly exceed the costs. For example, if a catalog can save 30% of a data scientist’s time that’s currently wasted on searching and prepping, that’s $40,000 per year. And that’s not even taking into account the business benefits of having that person working in a satisfying, challenging job doing what you hired him or her for.

Andrew Ahn is Vice President of Product Management for Waterline Data. He is an Apache Atlas committer and was the lead at Hortonworks for Hadoop governance strategy. Prior work includes product and governance responsibilities at ICE/NYSE Euronext, spanning 12 countries and 23 market centers.