Data Mining: Tools and Certificates

July 10, 2012
179 Views

As member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.

As member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.

Is being certified really worth it?
In my experience I think that a certificate in a specific data mining tool could be a positive point on the curriculum, but it doesn’t really help to improve your knowledge on the field.
Let me explain better (which is not easy with my bad English): The certificates system is a market and its target is to generate profit or to promote products.

Data mining tools
My question is: do you really think that can exist a tool able to embrace all aspects related to data mining?

I guess that the number of problems data mining related are so high that maybe we could use the Cantor diagunalization to proof that are uncountable 🙂

In my opinion is too naive the common thought that through a software, clicking here and there you can obtain tangible benefits in mining your data.

The “data mining” definition has been created by marketing industries just to summarize in a buzz word  technics of applied statistics and applied mathematics to the data stored in your hard disk.
I don’t want say that tools are useless, but it should be clear that tools are only a mean to solve a problem, not the solution.

  • In the real world the problems are never standard and really seldom you can take an algorithm as is to solve them! …maybe I’m unlucky but I never solved a real problem through a standard method.
  • The tool X is able to load Terabyte of data. And so what? A good data miner should know that you cannot consider the entire population to analyze a phenomena, you should be able to sample your population in order to ensure the required margin of accuracy! … this technic is simply called Statistic!
  • If you really want to claim “I know very well this approach”, you must be able to implement it by your self: only implementing it by your self you can deeply understand in which context the algorithm works, under which conditions it performs better than other tools and so on. Don’t rely only on one paper that compare few techniques: if you change just one of the conditions the results are terrible different.
  • Without theory you cannot go deep: Let’s consider a tool as Mathematica or R or … These tools allow the user to have access to a large library of pre defined algorithms and routine, they provide visualization functions to show results in a fancy way, and last but not least they provide a complete language programming to code whatever you want. I love them, but I couldn’t do anything without the theory behind the problem. Mathematica can provide me the algorithm to cluster a data set through k-means: but how can I be sure that it is the right algo for your problem? (click here to have a demo).
Actually I would prefer attend a course to deepen some aspects of multivariate statistic or seminars on new methodology to solve some problem respect pay plenty money to know every single detail of a tool, that maybe will not be in the market in the next 5 years.
I know that the companies often are looking for certified guys on a famous tool just because they bought it and they need to reduce the time to “integrate” a new resource on a team. Fair enough! …but I think it is ridiculous require certificates as strict requirement!
I’m really curious to know your experiences and opinions.
cristian