New Data Scientists Must Avoid these 4 Data Fallacies

There are countless applications of machine learning in 2019. The demand for machine learning developers is growing at a rapid pace. Let's take a look.

Avatar
March 25, 2019
33 Shares 4,264 Views

There are countless applications of machine learning in 2019. The demand for machine learning developers is growing at a rapid pace. MIT recently announced that it is committing $1 billion to a new program to educate technology professionals about machine learning and artificial intelligence. New academic programs are likely to be launched to focus on this rapidly growing field.

Although there are many benefits of machine learning, there are also a lot of challenges. Developers must be aware of the numerous data fallacies that can tarnish the quality of their machine learning algorithms. Here are some of the most common, according to one company that offers machine learning services.

Cherry picking updated algorithm changes while conducting manual edits

One of the main benefits of machine learning is that you can rely on your algorithms to adapt on their own over time. However, you are going to need to manually update your algorithms. Part of this process is going to entail looking at changes that were caused by machine learning.

You need to be careful about making changes. You might find that some machine learning changes are due to the preferences of your users, which might not be consistent with your own perspectives. You don’t want to eliminate these changes to adapt your system to reflect your own perception. Just always remember that your algorithms are supposed to reflect the needs and perspectives of your users. Substituting their preferences for your own is entirely counterproductive.

Using a small data threshold for machine learning decisions

When you are developing machine learning algorithms, it can be tempting to program them to formulate new insights from limited data sets. Of course, you won’t realize this will be the inevitable outcome until later on. You are using smaller data thresholds, because you want to make sure that the application modifies itself more quickly to bolster user performance and other expectations.

The problem with this is caused by data dredging. The majority of correlations are going to be due to chance. You need large amounts of data to get enough variance to draw accurate insights. Keep this in mind when defining their allowable limits for machine learning algorithms.

Defining machine learning algorithms without establishing the existence of the necessary data

Setting unacceptably low data thresholds is a problem, as stated above. However, it is also possible to use unrealistically high standards.

Before you begin setting the allowable limits for your machine learning applications, you need to make sure that collecting the necessary data will be conceivable in the first place. Establishing the availability of the data and the realistic hurdles that you must face to collect it must be a priority. If you are predicating your machine learning algorithms on data that is nearly impossible to accumulate, then you are going to need to re-define your limits.

Using data sets with dynamic column numbers and inconsistent structuring

Machine learning algorithms are going to need to simulate data from a variety of sources. They are often going to need to collect data from .csv files and other sources that can be a wealth of valuable information.

Although these data sources can be extremely useful for your algorithms, they are not without their drawbacks. One of the biggest concerns is that data might not be consistently formatted.

This is a frequent concern if you were trying to mind the data from.CSV files that numerous people have permissions to edit. It is especially risky if they are posted on Google docs or another open source cloud storage platform without any access controls.

Here is an example of a situation where this can be a problem. You are building a machine learning algorithm around a file with 17 columns. The first column references a user address, the second references the user’s first name, the third references the user’s last name and the fourth column corresponds to the date of their first purchase.

You develop a machine learning program that tries to reference their name and the date of purchase. However, in the process, somebody else that has access to the file decides to get rid of the column with their address on it. They assume that column is not relevant anymore. The issue is that this causes all the other columns to shift to the left. When you are trying to reference the last name, you are instead referencing the date of purchase until the algorithm is rewritten. Since machine learning algorithms become more familiar with differences over time, this can have long-term consequences even after the original file is restored where the algorithm is rewritten.

The moral of the stories to make sure that you reference data sources that are consistently structured.