Data Mining: A Closer Look
Data mining is defined as the process of using data to extract patterns. It is a branch of artificial intelligence and computer science that is used by businesses to obtain an informational advantage by transforming data into business intelligence. Related terms such as data fishing, date dredging and data snooping are terms that are used to describe data mining techniques for sampling portions of larger population data that may be too small for statistics that are reliable. These types of techniques can also be used to create a new hypothesis for larger populations to test against. Data mining can be integrated into a live answering service.
Pre-Processing
Prior to using any data mining algorithms, the target set of data must first be assembled. This data must be large enough to contain the necessary information for the patterns to be uncovered while at the same time being concise enough so that it can be mined in a reasonable timeframe. The target data is cleaned, which removes observations with missing data and noise. The data that has been cleaned is then divided into one feature vector per observation which are then divided into one training set and one test set.
Association Rule Learning
Association rule learning is the process of searching for the relationship between variables. It can sometimes be referred to as market basket analysis. In this instance, a grocer may take data from the shoppers to find the relationship between the items they commonly purchase together. This information can then be used for marketing purposes to show that those two items can be used together for a number of reasons.
- Generalized Association Rule Mining
- Machine Learning and Data Mining: Association Rule Mining
- Fast Algorithms for Mining Association Rules
Classification
Classification is the process through which new data is distinguished by a specific category based on its share characteristic. For instance, in terms of email accounts, they work to classify new and incoming emails as being spam or legitimate. Some of the more common algorithms, or formulas, for classifying data include nearest neighbor, neural networks and decision learning tree. Other common algorithms include support vector machines and naïve Bayesian classification.
- Clustering and Classification: Data Mining Approaches
- Data Mining: What is Data Mining?
- An Introduction to Data Mining
Clustering
Clustering is another one of the four tasks that are involved in data mining. Clustering involves examining data to discover structures and groups within this data. The structures and groups that are discovered in this data must be similar, but different from the structures that are already known. The term clustering may also be referred to as cluster analysis.
- An Overview of Data Mining Techniques
- Data Mining Cluster Analysis: Basic Concepts and Algorithms
- Clustering Algorithm
Regression
Regression is a task within data mining that attempts to find the best way to model the data that obtains the least amount of errors. It is also referred to as regression analysis and includes any techniques that are used to model and analyze several variables. It helps to show the value of the changes in variables when one remains constant, or fixed, and the other remains varied. Regression is a task that is more commonly used for forecasting and prediction.
Results Validation
Results validation is the final step of discovering knowledge from data. During this step, you are to verify all of the patterns that were produced during the data mining algorithms. Not all of the patterns that are found during the process are valid and this step separates the valid ones from the ones that are invalid. If the patterns do not meet the standards, the preprocessing and data mining will need to be reevaluated and changed.
