CSE GEC WORLD: INTRODUCTION ABOUT DATA MINING

1.1 Data Mining

The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discovery provides explicit information about the characteristics of the collected data, using a number of techniques (e.g., association rule mining).

Fig1.1 describes architecture of datamining system.

Forecasting and predictive modeling provide predictions of future events, and the processes may range from the transparent (e.g., rule-based approaches) through to the opaque (e.g.,neural networks).Metadata, (data about the characteristics of a data set), are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts. Data mining applications, either in a design phase or as part of their on-line operations. Data analysis procedures can be dichotomized as either exploratory or confirmatory, based on the availability of appropriate models for the data source, but a key element in both types of procedures (whether for hypothesis formation or decision-making) is the grouping, or classification of measurements based on either (i) goodness-of-fit to a postulated model, or (ii) natural groupings (clustering) revealed through analysis. Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. Intuitively, patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster. Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

The large amount of data that is usually present in Data Mining tasks allows to split the data file in three groups: training cases, validation cases and test cases. Training cases are used to build a model and estimate the necessary parameters. The validation data helps to see whether the model obtained with one chosen sample may be generalizable to other data. In particular, it helps avoiding the phenomenon of overfitting. Iterative methods incline to result in models that try to do too well. The data at hand is perfectly described, but generalization to other data yields unsatisfactory outcomes. Not only different estimates might yield different models, usually different statistical methods or techniques are available for a certain statistical task and the choice of a method is open to the user. Test data can be used to assess the various methods and to pick the one that does the best job on the long run. Data mining is defined as the automatic process of discovering patterns in data. It is estimated that amount of data doubles each year; therefore data mining is a practical tool involving learning from data with practices that often relies on statistical learning. Statistical learning is based on machine learning that incorporates statistical analysis techniques, where machine refers to computer. A computer can be intelligent in predictions after trained by data or learned from data. Computational intelligence is the study of designing intelligent algorithms or agents that make a computer behave intelligently. The goal is to understand the principles that make the computer intelligently solve problems based on the hypothesis that reasoning is in the process of computation. Computational intelligence differs from artificial intelligence in that it not considered as merely artificial; it exists in the reasoning process during computation.

A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other mining approaches, such as Choice Modelling for human-generated data. In these situations, inherent correlations can be either controlled for, or removed altogether, during the construction of the experimental design.

Sunday, 8 December 2013

INTRODUCTION ABOUT DATA MINING