By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: New Data Scientists Must Avoid these 4 Data Fallacies
Share
Notification Show More
Latest News
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
ai in omnichannel marketing
AI is Driving Huge Changes in Omnichannel Marketing
Artificial Intelligence
ai for small business tax planning
Maximize Tax Deductions as a Business Owner with AI
Artificial Intelligence
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > New Data Scientists Must Avoid these 4 Data Fallacies
Best PracticesBig DataData ManagementData ScienceExclusive

New Data Scientists Must Avoid these 4 Data Fallacies

Diana Hope
Last updated: 2020/11/21 at 11:11 PM
Diana Hope
6 Min Read
data scientists and Data Fallacies
Shutterstock Licensed Photo - By Sammby
SHARE

There are countless applications of machine learning in 2019. The demand for machine learning developers is growing at a rapid pace. MIT recently announced that it is committing $1 billion to a new program to educate technology professionals about machine learning and artificial intelligence. New academic programs are likely to be launched to focus on this rapidly growing field. Although there are many benefits of machine learning, there are also a lot of challenges. Developers must be aware of the numerous data fallacies that can tarnish the quality of their machine learning algorithms. Here are some of the most common, according to one company that offers machine learning services.

Contents
Cherry picking updated algorithm changes while conducting manual editsUsing a small data threshold for machine learning decisionsDefining machine learning algorithms without establishing the existence of the necessary dataUsing data sets with dynamic column numbers and inconsistent structuring

Cherry picking updated algorithm changes while conducting manual edits

One of the main benefits of machine learning is that you can rely on your algorithms to adapt on their own over time. However, you are going to need to manually update your algorithms. Part of this process is going to entail looking at changes that were caused by machine learning. You need to be careful about making changes. You might find that some machine learning changes are due to the preferences of your users, which might not be consistent with your own perspectives. You don?t want to eliminate these changes to adapt your system to reflect your own perception. Just always remember that your algorithms are supposed to reflect the needs and perspectives of your users. Substituting their preferences for your own is entirely counterproductive.

Using a small data threshold for machine learning decisions

When you are developing machine learning algorithms, it can be tempting to program them to formulate new insights from limited data sets. Of course, you won?t realize this will be the inevitable outcome until later on. You are using smaller data thresholds, because you want to make sure that the application modifies itself more quickly to bolster user performance and other expectations. The problem with this is caused by data dredging. The majority of correlations are going to be due to chance. You need large amounts of data to get enough variance to draw accurate insights. Keep this in mind when defining their allowable limits for machine learning algorithms.

Defining machine learning algorithms without establishing the existence of the necessary data

Setting unacceptably low data thresholds is a problem, as stated above. However, it is also possible to use unrealistically high standards. Before you begin setting the allowable limits for your machine learning applications, you need to make sure that collecting the necessary data will be conceivable in the first place. Establishing the availability of the data and the realistic hurdles that you must face to collect it must be a priority. If you are predicating your machine learning algorithms on data that is nearly impossible to accumulate, then you are going to need to re-define your limits.

More Read

become a data scientist

Boosting Your Chances for Landing a Job as a Data Scientist

What Data Scientists Must Know About Italy’s Tech Credentials
365 Data Science Courses Free Until November 21
Roles of Python Developer in Data Science Teams
5 Reasons for Data Scientists To Learn Ethical Hacking

Using data sets with dynamic column numbers and inconsistent structuring

Machine learning algorithms are going to need to simulate data from a variety of sources. They are often going to need to collect data from .csv files and other sources that can be a wealth of valuable information. Although these data sources can be extremely useful for your algorithms, they are not without their drawbacks. One of the biggest concerns is that data might not be consistently formatted. This is a frequent concern if you were trying to mind the data from.CSV files that numerous people have permissions to edit. It is especially risky if they are posted on Google docs or another open source cloud storage platform without any access controls. Here is an example of a situation where this can be a problem. You are building a machine learning algorithm around a file with 17 columns. The first column references a user address, the second references the user?s first name, the third references the user?s last name and the fourth column corresponds to the date of their first purchase. You develop a machine learning program that tries to reference their name and the date of purchase. However, in the process, somebody else that has access to the file decides to get rid of the column with their address on it. They assume that column is not relevant anymore. The issue is that this causes all the other columns to shift to the left. When you are trying to reference the last name, you are instead referencing the date of purchase until the algorithm is rewritten. Since machine learning algorithms become more familiar with differences over time, this can have long-term consequences even after the original file is restored where the algorithm is rewritten. The moral of the stories to make sure that you reference data sources that are consistently structured.

TAGGED: data fallacies, Data Science, Data Scientist
Diana Hope March 22, 2019
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
ai in omnichannel marketing
AI is Driving Huge Changes in Omnichannel Marketing
Artificial Intelligence
ai for small business tax planning
Maximize Tax Deductions as a Business Owner with AI
Artificial Intelligence
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

become a data scientist
Jobs

Boosting Your Chances for Landing a Job as a Data Scientist

9 Min Read
tech credentials needed to find data science jobs in Italy
Data Science

What Data Scientists Must Know About Italy’s Tech Credentials

9 Min Read
365 Data Science
Data Science

365 Data Science Courses Free Until November 21

4 Min Read
hire the right python developers for your data science team
Python

Roles of Python Developer in Data Science Teams

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?