The Difference Between ‘Knowledge Discovery’ and ‘Data Mining’

July 29, 2017
1204 Views

KDD is a non-trivial process for identifying valid, new, potentially useful and ultimately understandable patterns in dat. It consists of nine steps that begin with the development and understanding of the application domain to the action on the knowledge discovered. Data mining is one of the steps (seventh) and the KDD process is basically the search for patterns of interest in a particular representational form or a set of these representations.

Areas where KDD is used: 

1. Astronomy: SKICAT, a system used by astronomers for the analysis of images, the classification and cataloging of sky objects of the images under study. 

2. Marketing: analyze customer databases to identify different groups of customers and predict their behavior.

3. Investment: using expert systems, neural networks and genetic algorithms to manage folders, fraud detection.  HNC and Nestor Falcon PRISMA to monitor credit card fraud and CASSIOPEE was applied, using clusters to derive families of failures in three large European airlines to diagnose and predict problems in the Boeing 737.

4. Manufacturing: FAIS is used to identify financial transactions that could indicate money laundering activities.

5. Telecommunications: RATE to locate events that occur frequently alarm from the alarm stream and presenting them as rules provides a pruning tool, grouping and sorting.

6. Data Cleaning: MERGE-PURGE was applied to identify claims for social assistance and ADVANCED SCOUT is a system of specialized data mining helps NBA coaches to organize and interpret data from the games of the NBA.

7. Internet FIREFLY is an agent of personal music recommendation, CRAYON allows users to create their own free newspaper and Farcast looking automatically for the user information from a wide variety of sources, etc.

What is Data warehouse and what are its stages 

Popular trend of collecting and cleaning data for transactional them available for online analysis and support decision making. The data warehousing helps set the stage for KDD in two important ways:

1. Data Cleaning: To the extent that organizations are forced to think they have a unified logical view of a wide variety of data and databases have to worry about mapping the data to a single convention for representing and managing names missing data evenly and, where possible, handling noise and errors.

2. Data Access: You must create uniform, well-defined methods to access data and provide paths to data that historically are difficult to obtain (eg, data stored offline).

Defining OLAP 

Is a solution used in the field of Business Intelligence, which consists of consultations with multidimensional structures that contain summarized data from large databases or transactional systems. OLAP tools focus on providing multidimensional data analysis which is superior to SQL in computing summaries and control cuts across multiple dimensions. OLAP tools are geared towards simplification and support for interactive data analysis, but the goal of KDD tools is to automate the process as possible.

KDD Process Stages 

1. Development and understanding of the application domain and the relevant prior knowledge and identifying the goal of KDD process from the customer perspective. 

2. Creating a target data set: select the data set, or focusing on a set of variables or data samples on which the discovery was made.

3. Data cleaning and preprocessing. Basic operations include removing noise if appropriate, collect the necessary information to model or account for noise, deciding on strategies for handling missing data fields and account for time sequence information and known changes.

4. Data reduction and projection: finding useful features to represent the data depending on the purpose of the task. Through dimensionality reduction methods or conversion, the effective number of variables under consideration may be reduced, or invariant representations for the data can be found.

5. Matching process objectives: KDD with (step 1) a method of mining particular. For example, summarization, classification, regression, clustering and others.

6. Modeling and exploratory analysis and hypothesis selection: choosing the algorithms or data mining, and select the method or methods to be used in the search for patterns of data. This process includes deciding which model and parameters may be appropriate (eg, categorical data models are different models on the real vector) and the matching of data mining methods, particularly with the general approach of the KDD process (for example, the end user might be more interested in understanding the model in its predictive capabilities).

7. Data Mining: the search for patterns of interest in a particular representational form or a set of these representations, including classification rules or trees, regression and clustering. The user can significantly aid the data mining method to properly carry out the preceding steps.

8. Interpreting mined patterns, possibly returning to some of the steps between step 1 and 7 for additional iterations. This step may also involve the visualization of the extracted patterns and models or visualization of the data given the models drawn.

9. Acting on the discovered knowledge: using the knowledge directly, incorporating the knowledge in another system for further action, or simply documented and reported to stakeholders. This process also includes checking and resolution of potential conflicts with previously believed knowledge (or extracted).

What is Data Mining 

Data mining is a step in the KDD process of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) on the data. Note that the pattern space is generally infinite and the enumeration of patterns involves some form of search that space.

We use two primary mathematical formalisms for the adjustment of models:

Statistical allows non-deterministic effects in the model
Logical: it is purely deterministic.

Data Mining Methods 

Classification is learning a function that maps (classifies) a data item in one of several predefined classes.

Regression is learning a function that maps a data item to a predictor variable with values in R.

Clustering is a common descriptive task where it seeks to identify a finite set of categories or clusters to describe data

Summarization involves methods for finding a compact description for a data set.

Dependency model is to find a model that describes significant dependencies between dependency models exist at two levels:
level structural model specifies (often in graphical form) which of the variables are locally dependent on other
quantitatively model specifies the strength of the dependencies using some numerical scale.

Detection of changes and deviations are focused on the discovery of the most significant data from previous measurements or normative values

Components of the data mining algorithm

Representation of the Model is the language used to describe patterns discoverable.

Evaluation criteria are quantitative statements of the model (or functions adjustment) on how well a particular pattern (a model and its parameters) meet the objectives of the KDD process.

Search method consists of two components:
a) search parameter
b) Model Search

Once the model representation (or family of representations) and model evaluation criteria are established, then the problem of data mining has been reduced to just an optimization task: find the parameters and models from the family selected to optimize the evaluation criterion

Examples of Methods of Data Mininig 
1. Decision trees and rules that use univariate splits, have a simple form of representation, making the inferred model relatively easy to understand for the user.However, the restriction of the representation of a particular tree or rule can restrict the functional form (and thus the power of approximation) model. If one enlarges the model space to allow more general expressions (such as multivariate hyperplanes at arbitrary angles), then the model is more powerful in his prediction but can be much more difficult to understand. In large part, they depend on the likelihood-based methods in the evaluation of models, with varying degrees of sophistication in terms of penalizing the model complexity.
2. Methods for Nonlinear Classification and Regression consist of a family of predictive techniques that do fit linear and nonlinear combinations of basis functions (sigmoid, splines, polynomials) with combinations of input variables.
3. Examples use methods based on representative samples taken from the databases to approximate a model, that is, predictions for new examples are derived from the properties of similar examples in those models where the prediction is known.Techniques include nearest neighbor classification, regression algorithms and systems thinking. A potential disadvantage of methods based on examples (compared to tree-based methods) is that it requires a well-defined distance metric to assess the distance between data points.
4. Probabilistic graphical dependency models probabilistic dependencies specified using a graphical structure. In its simplest form, the model specifies which variables are directly proportional to each other.
5. Relational learning models even though the representation of trees and decision rules is restricted to a propositional logic, relational learning (also known as induction programming logic) uses the standard first order language more flexible.

Data Mining problems applying 

Practical approach: KDD for similar projects for other applications of advanced technology, and includes:
1. Potential impact of an application
2. Simple absence of alternative solutions
3. Strong organizational support for the use of technology.
4. For applications that handle personal information, one should also consider legal and privacy issues.

Technical criteria, including considerations such as:
1. Availability of sufficient data.
2. Relevance of attributes.
3. Low noise levels
4. Knowing something about the domain;

Research and Application Challenges 

Large databases: Common databases with hundreds of tables and fields and millions of records.

High dimensionality: It is not only a large number of records in the database, but can also be a lot of fields (attributes, variables), so that the dimensionality of the problem is high.

Overfitting: When the algorithm finds the best parameter for a particular model using a limited set of data, it can model not only the general patterns of data but also any noise-specific set of data, resulting in a low-performance model test data.

Determination of statistical significance: A problem occurs when the system is searching for several possible models.

Change Data and Knowledge: The rapid change of the data can invalidate previously discovered patterns. In addition, the measured variables in an application given database can be modified, deleted or argued with new measures over time.

Missing data and noisy: This problem is especially acute in business databases.Important attributes may be lost if the database was not designed with the discovery in mind.

Complex relationships between fields: The hierarchically structured attributes or values, relationships between attributes, and more sophisticated ways of representing knowledge about the contents of the database will require algorithms that can effectively use such information.

Understanding patterns: In many applications it is important to make the discovery more understandable to humans.

User interaction and prior knowledge: Several current methods and tools are not truly interactive KDD and can not easily incorporate prior knowledge about the problem except in simple ways. The use of domain knowledge is important in all steps of the KDD process.

Integration with other systems: A system of independent discovery might not be very useful. Typical problems of integration include: integration with a database manager (eg through a query interface), integration with spreadsheets and visualization tools, and accommodation of sensor readings in real time.