Sunday, November 2, 2014

Methods in Data Mining

Classification


In data mining classification perhaps considered being the most used method for problem solving. Classification studies the patterns of historical data (a set of information - like characteristics, variables, features - on a variety of characteristics of the items that have been labeled previously) for the purpose of placing new instances (objects) into groups or classes. For example classifications can be used to predict weathers, frauds, communications, and other class labeled conditions. However when the condition type to be predicted is in a numerical data it cannot be called classifications instead of regression.


There are two common steps in predicting classifications types. The first is model development and second is model testing. In the model development phase a set of input data, including a variety of actual class label, is used. After the model is developed, then it is tested on to a sample data to be implemented afterwards.



In various problems of classification, the one used as the main source for estimating accuracy is the confusion matrix





The numbers on the diagonal from top left to bottom right is the result of correct predictions, and the numbers outside this diagonal is the result of incorrect predictions.


Several popular methodologies for identifying classification models


Simple split divides the data into two mutually exclusive subsets of data with each other is called the training set and' test set. What commonly done is to choose two-thirds of the data used for the training set and the remaining one third of the data was used as the test set. Training data is used by the 'inducer' (model builder), and a classifier that has been made and then tested on the test set.





K-Fold-Cross validation is the method where whole datasets is divided randomly into (k). model in the classification is developed and tested for (k) times. In every training (development), all fold are trained but one fold alone is left for testing. Cross-validation assessment of the overall accuracy of the model is calculated by taking the average of all the k individual accuracy results as you can see in the formula

 

Where CVA is the accuracy of cross-validation, k is the number of folds used, and A is a measure of the accuracy.



Decision tree

A decision tree is a flowchart-like structure in which internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The path from root to leaf represents classification rules.



When creating a decision tree, the objective of each node is to determine the attributes and attribute splitting point that divides the data the best way to clean presentation training class on the node. To evaluate how well the splitter is, several indexes for the splitters are Gini index and information gain



The Gini index is already used in economics to measure the difference in population. The same concept can be used to determine the purity of a particular class as a result of the decision to expand along certain attributes or variables. The best part of the split is the one increasing the purity of the data generated from the splitter that has been proposed.

 


Here is the formula of the Gini index



Information gain is a splitting mechanism that is probably the most known decision tree algorithm. This concept used 'Entropy' to replace the Gini index. Entropy measures the degree of uncertainty or random-to-late in a dataset. If all the data in the subset only belong to one class only, that means no 'to-unsurpassable-ness' or 'to-random-ness' in the dataset, so its entropy is zero.  The purpose of this approach is to make the 'sub-trees' so that the entropy of each last subset is zero (or near zero).





Here is the formula of Information gain

No comments:

Post a Comment