Data Mining Fundamentals: Terms You Must Know

June 29, 2011
305 Views

If anyone tells you there is a firm definition for the term “data mining,” that person is either misinformed or flat-out lying. While I’m pleased to report that I haven’t encountered many conscious liars in this realm, misinformation is very common, even among professionals who oughta know better. Just as common is use of unnecessarily complex language, to the point that I have difficulty understanding many of my colleagues, so heaven help the novice. No wonder so many business people still think of analytics as fancy stuff they don’t need!

If anyone tells you there is a firm definition for the term “data mining,” that person is either misinformed or flat-out lying. While I’m pleased to report that I haven’t encountered many conscious liars in this realm, misinformation is very common, even among professionals who oughta know better. Just as common is use of unnecessarily complex language, to the point that I have difficulty understanding many of my colleagues, so heaven help the novice. No wonder so many business people still think of analytics as fancy stuff they don’t need!

Analytics doesn’t have to be incomprehensible to be good. In fact, if you can’t understand it, it’s probably doing you no good at all. The best analysts communicate in plain business language!

Want a better understanding of data mining basics and a better ability to see through analytics mumbo jumbo? Start by getting a solid understanding of some of the most common terminology.

analysis

Any method, formal or informal, of summarizing data into a (comparatively) brief, informative form. (At least, it is informative in the view of the analyst.)

analytics

Another very general term, but unlike “analysis,” “analytics” always implies that there is math involved in the process. Some people use this term to refer to simple reports (usually data summarized as totals and averages of historical data), while others are referring strictly to more sophisticated analysis, such as inferential statistics.

data mining

This term came into widespread use in the 1990s to describe techniques and tools geared to enabling business users (people knowledgeable about their business, but not trained in statistical analysis) to independently identify meaningful patterns in data and develop predictive models, with only a moderate amount of training in the use of the data mining toolset. Some of the distinctive elements associated with data mining include: empowerment of business users, emphasis on visualization (graphs), speed and simplicity of discovery and model development.

The vision of empowering business users has yet to become a widespread reality. Today, users of data mining tools usually have significant training or experience in traditional statistical analysis, and the tools have been expanded to offer a wide variety of sampling and traditional statistical modeling techniques.

Because of the rise in popularity of data mining, many vendors and analysts have taken to describing whatever they offer as “data mining.” What’s more, many analysts have taken to advising clients that data miners must have expertise in SQL, programming and/or a variety of other skills that are not necessary to obtaining useful business insight from data. If it’s not geared to speedy discovery of meaningful patterns and predictive models from business data, or if using the tool requires lengthy formal training in statistics, programming or anything else, it’s not data mining.

predictive analytics

Development and use of mathematical models that make predictions about specific events, such as whether an individual will buy a product or repay a loan. These predictions are usually in the form of probabilities. Both traditional statistics and data mining can be used for predictive analytics.

statistics

At the most basic level, “statistics” can refer to simple summaries, such as totals and averages. More sophisticated (and revealing) statistical analysis is based on testing hypotheses about data. This type of analysis is known as “inferential statistics” or “hypothesis testing.”  Many formal procedures have been developed for inferential statistics to suit a wide variety of uses.

Libraries could be filled with the many and varied books written on statistical procedures. That said, most businesses that use inferential statistics make use of just a small selection of widely used procedures, and good explanations of these procedures can be found in most any current text used for introductory college statistics courses.

text mining (or, text analysis)

This is data mining, when the data is text, such as responses to open-ended survey questions, social media posts or comments on warranty claims. In this context, text is often described as “unstructured data.” Text mining is a developing field, not yet used by many businesses. Text mining is the most challenging area of data analysis today.

web analytics

Analytics based on data describing events occurring on the Internet, or some other similar network. In practice, most web analytics are simple summaries, counts of events such as page downloads, referrals from specific sites and so on. However, more sophisticated analytics can also be applied to web data. Inferential statistics applied to web page performance (in sales or some other desired behavior) is “A/B” or “multivariate” testing. Data mining techniques used to study user movement through as web site are “sequence analysis.”

Not fancy or incomprehensible, is it? And there is no reason why it should be.

 

©2011 Meta S. Brown