Introduction to Data Mining

Dear Data Mining Research readers, I wish you all an excellent year 2013! How to better start this new year than with an introduction to data mining (for non-experts)? Enjoy! Data alone is worth almost nothing. While data is increasing exponentially, people in some fields are “starving” for knowledge. In spite of this, the gap between data and knowledge may be huge. These days, the meaning of the word data is often confused with knowledge. Knowledge is obtained through the understanding of data. The amazing increase in data worldwide brings several challenges. The more the amount of data, the more difficult it is to understand. It is sometimes assumed that the increase of knowledge is proportional to the increase of data. The reason for such an assertion might be the lack of appreciation of the difference between obtaining and understanding data. Data mining is a field which is concerned with understanding data. In other words, the aim is to look for patterns in data. As this pattern may be very difficult to find, it is sometimes compared to gold mining in rivers (see Figure); gravel represents the enormous amount of data and gold nuggets are the hidden patterns to find. Data mining methods can be grouped in two main categories: supervised learning and unsupervised learning. Supervised learning can be seen as learning with a teacher that gives feedback for the learning task. This feedback is represented by a training set and consists of examples with both input and output values. It is opposed to the test set, which is the final set one want to test and that consists only of input values (the output is predicted). Patterns in data can be automatically identified, validated on existing data and then used for predictions with new data. In unsupervised learning, no feedback is given to the learning algorithm (i.e. no teacher). Particularities of this category are that trends are directly inferred from the data set, thus no output is known for a given data set. Several recent textbooks cover the data mining research area [1][2]. Data mining is usually applied to tasks such as recognition of images, characters and speech. Data mining has also been successfully applied in domains such as crime pattern detection, gene classification, email classification and collaborative filtering. We would like to finish this article by a quote highlighting the bright future of data mining: “[…] as long as the world keeps producing data of all kinds […] at an ever increasing rate, the demand for data mining will continue to grow.” [3] [1] Hand D., Mannila H. and Smyth P., Principles of Data Mining, MIT Press (2001) [2] Tan P.-N., Steinbach M. and Kumar V., Introduction to Data Mining, AddisonWesley (2006) [3] Piatetsky-Shapiro G., Data mining and knowledge discovery 1996 to 2005: overcoming the hype and moving from “university” to “business” and “analytics”, Data Mining and Knowledge Discovery, 15(1):99-105 (2007)