By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data science anayst
    Growing Demand for Data Science & Data Analyst Roles
    6 Min Read
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Earthquake Prediction Through Sunspots Part II: common Data Mining Mistakes!
Share
Notification Show More
Latest News
ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > Earthquake Prediction Through Sunspots Part II: common Data Mining Mistakes!
AnalyticsBest PracticesData MiningPredictive AnalyticsStatistics

Earthquake Prediction Through Sunspots Part II: common Data Mining Mistakes!

cristian mesiano
Last updated: 2012/04/04 at 7:54 PM
cristian mesiano
7 Min Read
SHARE

While I was writing the last post I was wondering how long before my followers will notice the mistakes I introduced into the experiments.

Let’s start the treasure hunt!

1. Don’t always trust your data: often they are not homogeneous.
In the post I put in relation the quakes in the range time between [~1800,1999] with the respective sunspots distribution.

A good data miner must always check his dataset! you should always ask to yourself whether the data have been produced in a congruent way.

More Read

data science anayst

Growing Demand for Data Science & Data Analyst Roles

Predictive Analytics Helps New Dropshipping Businesses Thrive
The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
Analytics Changes the Calculus of Business Tax Compliance
Data Mining Technology Helps Online Brands Optimize Their Branding

While I was writing the last post I was wondering how long before my followers will notice the mistakes I introduced into the experiments.

Let’s start the treasure hunt!

1. Don’t always trust your data: often they are not homogeneous.
In the post I put in relation the quakes in the range time between [~1800,1999] with the respective sunspots distribution.

A good data miner must always check his dataset! you should always ask to yourself whether the data have been produced in a congruent way.

Consider our example: the right question before further analysis should be: “had the quakes magnitude been measured with the same kind of technology along the time?”

I would assume that is dramatically false, but how can check if our data have been produced in a different way along the time?

In this case I thought that in the past, the technology wasn’t enough accurate to measure feeble quakes, so I gathered the quakes by year and by the smallest magnitude: as you can see, it is crystal clear that the data collected before 1965 have been registered in different way respect the next period.

The picture highlights that just major quakes (with magnitude > 6.5) have been registered before 1965.
This is the reason of the outward increasing of quakes!

… In the former post I left a clue in the caption of “quakes distribution” graph 🙂

In this case the best way to clean up the dataset is to filter just quakes having magnitude grater than 6.5.
Let me show you a different way to display the filtered data: “the bubble chart”.
The size of the bubble is representative of the magnitude of the quakes 

The size of the bubble is representative of the number of the quakes
I love the bubble chart because it is really a nice way to plot 3D data in 2D!!
 
2. Sampling the data: are you sampling correctly your data?
In the former post I considered only the quakes registered in USA. 
 
Is it representative of the experiment we are doing?
 
The sunspots should have effects on the entire Earth’s surface, so this phenomena should produce the same effects in every place.
 
…But as everybody knows: there are regions much more exposed to quakes respect other areas where the likelihood to have a quake is very low.
 
So the right way to put in relation the two phenomena is to consider the World distribution of the quakes.
 
3. Don’t rely on the good results on Training Set.
This is maybe the worst joke I played in the post 🙂 I showed you very good results obtained with the support regression model.
 
…Unfortunately I used the entire data set as training set, and I didn’t check the model over a new data set!
 
This kind of mistake in the real scenario, often generates false expectation on your customer.
 
The trained model I proposed seemed very helpful to explain the data set, but as expected it is not able to predict well :(.
 
How can you avoid the overfitting problem? The solution of this problem is not so trivial, but in principle, I think that cross validations techniques are a safe way to mitigate such problem.
 
Here you are the new model:
The left graph shows the Training Set (in Blue the number of quakes per year, in Red the forecasting model).
The graph on the right side shows the behavior of the forecasting model over a temporal range never seen before by the system. The mean error is +/-17 quakes per year.
 
The Magnitude forecasting
(on the left the training set, on the right side the behavior of the forecasting model over the test set).
The mean error is around +/-1.5 degrees.
Considering the complexity of the problem I think that the regressor found works pretty good.

 
Just to have a better feeling of how the regressor is good, I smoothed the data through a median filter:
Moving Median Filtering applied to the Magnitude regressor.
Looking at the above graph, it seems that the regressor is able to follow the overall behavior.
 
As you can see such filtering returns a better understanding of the “goodness” of your models when the function is quite complex.

4. You found out a good regressor, so the phenomena has been explained: FALSE.
You could find whatever “link” between totally independent phenomena … but this link is just a relation between input/output. nothing more, nothing less.
 
As you know this is not the place for theorems, but let me give you a sort of empirical rule:
“The dependency among variables is inverse proportional to the complexity of the regressor”.
 
As usual stay tuned.
Cristian

 

cristian mesiano April 4, 2012
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data science anayst
Data Science

Growing Demand for Data Science & Data Analyst Roles

6 Min Read
predictive analytics in dropshipping
Predictive Analytics

Predictive Analytics Helps New Dropshipping Businesses Thrive

12 Min Read
data-driven approach in healthcare
Analytics

The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas

6 Min Read
analytics for tax compliance
Analytics

Analytics Changes the Calculus of Business Tax Compliance

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?