What To Know About The Impact of Data Quality and Quantity In AI

The impact of data quality and quantity in artificial intelligence (AI) is important to monitor and consider. Here's what to know about it.

November 16, 2018
37 Shares 2,996 Views

Believe it or not, there is such a thing as “good data”and “bad data” — especially when it comes to AI. To be more specific, just having data available isn’t enough: There’s a distinction worth making between “useful” and “not-so-useful” data. Sometimes data must be discarded on sight because of how or where it got collected, signs of inaccuracy or forgery and other red flags. Other times, data can get processed first, then passed on for use in artificial intelligence development.

A closer look at this process reveals a symbiotic relationship between our ability to gather data and process it — and our ability to build ever-smarter artificial intelligence. Data and machine learning both power AI, and AI, in turn, delivers more sophisticated machine learning tools. It’s a perfect system that has implications for businesses of every type and size, not to mention statisticians and scientists.

Why “Bad Data” Exists and Quantity Isn’t Enough

Why is there even a question of quality when it comes to data for AI? Isn’t having access to huge amounts of data enough? The answer is no — it’s not enough. And it’s because of factors like:

  • Incredibly high volumes of data from many channels
  • The geographic significance of where the data was gathered
  • Multiple file types and structured and unstructured data
  • Data that is inadmissible, based on regional privacy restrictions
  • Potential counterfeit data purchased on marketplaces

Machine learning is one tool used in the process of developing AI. A layman’s description of machine learning involves collecting a huge amount of structured data and using it to “train” an artificial intelligence to observe and recognize patterns based on known parameters. Until machine learning, most of us assumed true AI would only come about thanks to painstaking, line-by-line coding that foresaw, in advance, every potential eventuality. We see now this was an error for many reasons.

And it brings us back to the idea that not every kind of data, and not every data source, is useful or of sufficiently high quality for the machine learning algorithms that power artificial intelligence development — no matter the ultimate purpose of that AI application. After all, you quickly reach diminishing returns when it comes to data quantity: A data set only needs to be so big before it’s truly representative of the whole. But figuring out what “the whole” is, in the first place, is what machine learning is for — and relying on huge troves of duplicated or inaccurate data is a poor way to build context and understanding.

According to experts, compiling a store of data that’s equal parts large and useful requires a lot of manual effort. Additional insight from the world of data science indicates poor data quality is a leading cause of wasted investments in IT departments and a significant source of lost trust in enterprise-level management tools that inform business decisions.

So the stakes are high. Let’s go into more detail about why AI and high data quality go hand in hand.

The Relationship Between Data Quality and AI Is Symbiotic

The users of nearly all product types are taking a keener interest than ever in how those products get made. It’s much the same for the users of automation software, business intelligence platforms, route planning, mapping and any other business-facing AI application. Users have certain expectations about how to produce these things — namely, that the data powering these tools and insights is not:

  • Duplicated, counterfeit or stolen
  • Incomplete
  • Corrupted or broken
  • Inconsistent or incomprehensible

In other words, if you can’t trust components in your car that include substandard materials, you can’t rely on the analytics, analysis and insights AI promises.

So, the development of artificial intelligence platforms that deliver meaningful and actionable insights in real-world conditions requires high-quality data. The good news is, AI, in turn, helps us collect and store even more useful data over time.

To begin with, think about all the different types of data we’re collectively trafficking in now as a global business community. Your own company might trade in one, or more than one, of the following:

  • Data on the condition and location of physical assets
  • Data from sensors on production floors or other facilities
  • Historical and real-time sales data
  • Data on customer demographics and social tendencies
  • Geospatial and geographical data from site surveys and customer studies
  • Data from order tracking, re-ordering and monitoring supply levels

The point is, modern commerce requires an almost ludicrous amount of data. If it doesn’t already, competitiveness in your industry will soon depend on your ability to mobilize higher technologies and help you derive meaning, intent, direction and insight from the data types listed above.

So we’re back to the quality of your data. If informs the business decisions you’re already making, so it must also inform the analytics, automation and AI tools you’ll need to compete in a leaner and more global economy.

Examples to Bring the Point Home

One case study proved why data quality is essential in machine learning algorithms in the global retail market.

The ultimate goal of this retail company was to achieve cost reductions and bolster efficiency by better managing their product throughout and inventory data. But before that could happen, they needed to know the data they’d be relying on would suit their needs. So they used machine learning to look for errors, omissions, duplicates and outliers. The machine learning algorithm ended up making about 30 percent of their data more accurate, and therefore more actionable and useful, just by making small corrections.

There are examples of AI tools in science and academics benefiting from higher-quality data, too. In statistics, combing through sets of data for errors is a huge, expensive and labor-intensive process. But machine learning has demonstrated significantly better results than human statisticians ever could in “cleansing” huge sets of data for disqualifying errors or incompleteness.

In other words, it’s not just enterprise and commerce that benefit from the way machine learning powers AI development through better data and improved data processing techniques. The worlds of scientific, social and demographic inquiry should also find themselves with better tools in time, all thanks to higher-quality data.