Wednesday, November 5, 2014

Processes in Data Mining

Processes in Data Mining

To carry out projects in Data Mining systematically, a process that generally applies is usually applied. Based on 'best practice', practitioners and researchers DM proposes several processes (workflows or approach step-by-step simple) to increase the chances of success in implementing projects DM. Efforts that eventually resulted in several processes that serve as standards, some of which (the most popular) are discussed in this section.



One of the processes that have been used as the standard, and arguably as the most popular, the 'Cross-Industry Standard Process for Data Mining' - or CRISP-DM - have been proposed in the mid-1990s by a consortium of European companies to be non-standard methodology proprietary for DM (CRISP-DM, 2009). The following figure below illustrates the proposed process, which is a six-step sequence that starts with a good understanding of the business and the need for the DM project and ends with a 'deployment' solutions that satisfy the needs of a particular business.


Although these measures are essentially sequential, but in general there are a lot of backtracking. Because DM is driven by experience and experiment, depending on the current situation and problems of knowledge / experience of the analyst, then the whole process can be highly iterative and time consuming.

Step 1: Business Understanding

A key element of any DM study is to find out for sure what the review is done. To answer this question you should start with a complete understanding of the managerial needs of the new knowledge and an explicit specification of the business objectives of the study were conducted. At this very early stage, the budget to support this study should also be specified.

Step 2: Data Understanding

Studies in DM is specifically to discuss the employment of a business that is already well defined, and jobs requiring different business 'data-set' different. Having an understanding of the business, the main activity of the process is to identify the next DM relevant data from various existing databases. Several key points should be considered in the process of data identification and selection phase (data). The first and most important is that the analyst must be clear and concise about the DM job description so that relevant data can be identified. Next, the analyst must build a deep understanding of the various sources of data, and a variety of variables.

In order to better understand the data, the analyst should often use a variety of statistical and graphic techniques, such as simple statistical summaries. Identification and selection of data sources and jelly relevant variables can facilitate the algorithms used in DM to quickly find patterns of knowledge that really useful. Data source for the data selection process can vary. Normally, the sources of data for business applications include demographic data, sosiographic, transaction data, and so on.

Data can be categorized as quantitative and qualitative. Quantitative data measured by the numerical values. Qualitative data, also called categorical data, including nominal and ordinal data.


Step 3: Data Preparation

The purpose of the preparation of the data (or better known as the pre-processing of data) is to take the data identified in the previous stage and prepare her for analysis using methods of DM. Compared with other phases in CRISP-DM, the pre-processing of the data takes time and effort at the most people believe that this phase is responsible for about 80 percent of the total time spent for the DM project.


Picture below this will show the four main steps required to convert the raw data into a real dataset that can be extracted.




In the first phase of pre-processing the data, the relevant data is collected from a variety of sources that have been identified (which has been achieved in the previous step-Data-Understanding of the CRISP-DM), a variety of data and the necessary variables are taken (based on the understanding that depth of the data and the parts that are not needed removed), and various data coming from various sources was incorporated (again, with a deep understanding of the data, things are synonyms and homonyms handled correctly).

In the second phase of processing the data, the data is cleaned. At this stage, all the values ​​in the dataset are identified and studied. In some cases, there are values ​​that are empty / missing and is an anomaly in the dataset, in which the values ​​that need to be filled. and in other cases, the value of the empty / missing is indeed part of the dataset (eg, a column in the 'household income' is often intentionally left blank by people who have high incomes). At this stage the analyst should also identify the 'noisy' / disturbing values.

In the third phase of the pre-processing of data, the data is transformed for the sake of a better process. For example, in many cases normalize the data in between the minimum and maximum values ​​for all variables in order to reduce potential bias from a variable that has a value far above other variables that have little value. Another transformation that occurs is aggregation. In some cases, the variables are converted into numerical values ​​of the category (eg, 'low', 'medium', 'high'); in other cases the range of unique values ​​of nominal variables is reduced to smaller datasets by using the concept of hierarchy with the goal of producing a dataset that is more suitable for computer processing.

The last phase of the pre-processing of the data is the reduction of data (data reduction). Although the data miner happy to have a large dataset, but too much data will also be problematic. In the simplest sense, one can visualize the data used in various projects of DM as a regular file that contains two dimensions: the variables (number of columns) and case / records (number of lines). In some cases, the number of variables can be frightening many, and the analyst must reduce the amount to be managed with ease. Because the variables are treated as different dimensions that explain the phenomenon from different perspectives, the DM process is usually called the dimensional reduction In relation to the other dimensions (eg, number of cases / rows), some datasets may consist of millions or billions of rows. Though computing capabilities continue to increase exponentially, process such a large number of lines is not practical. In such a case, we can take a sample dataset for analysis. Assumptions that form the basis of the sampling is that the subset of the data will contain all the relevant patterns from the dataset complete / intact. In homogenous datasets, assuming that it could be held in check as well, but the real data is never homogeneous. The analyst must be extra careful in selecting a subset of data that reflects the essence of the data is intact and not specific to a subgroup or subcategory. The data is usually sorted in a variable, and take the data section from top to bottom may produce unbalanced dataset at certain values ​​of the variables that have been indexed, therefore, we should always try to choose random rows on sample-set.

For 'skewed' data (unbalanced and not linear), direct random sampling is still lacking, and for the stratified sampling may have to be applied. Speaking of skewed data should be a process to balance the unbalanced data very well with the way the oversampling in the group of missing data or under sampling for many pieces of data. Research shows that the dataset is balanced Predictive models tend to produce better than none balanced one.


Step 4: Model Building

In this stage, the various modeling techniques are selected and applied to a dataset that has been prepared to address specific business needs. Stage also includes an assessment of the modeling and comparative analyses of the various models are built. Because no single model that is universally regarded as the best method or algorithm for DM jobs, we have to use different types of models and experimentation and assessment strategies that have been defined to determine the best method in accordance with the purposes specified.

Step 5: Testing and Evaluation

In step 5, the model that has been created tested and evaluated the accuracy and general. This phase measures the extent to which the selected models meet the business objectives and if so, the extent to which it (if it needs more models to be made and measured). Another option is to perform testing on a model that has been made ​​with the real scenario (real-world) if time and budget allow. Although the results of the models that have been developed are expected to be related to the original purpose, but other findings were not related to the original purpose and may also disclose any additional information or instructions for the future are often found.

Step 6: Deployment

Preparation and assessment of the various models of diabetes is not the end of the project. Although the intention is to have a model of a simple exploration of the data, the knowledge gained from the exploration needs to be arranged and presented in a way that can be understood by the 'end user' and can be taken advantage. Depending on various requirements, the stage of 'deployment' This can be very simple as well as create a 'report' or it could be as complex as implementing a process that is repeated throughout the corporate DM.

Some other standard processes and methodologies in DM

In order to be applied successfully, a study of DM must be seen as a process that follows a standard methodology rather than as a set of techniques and software tools automatically. Besides CRISP-DM, there is one other well-known methodology developed by the 'SAS Institute', which is called the SEMMA. SEMMA abbreviation is 'Sample, Explore, Modify, Model, and Assess'.

Starting with the sample data are considered statistically representative, SEMMA facilitate applying visualization techniques and statistics that are looking for or browse, choose and transform the variables most significant predictor, modeling variables to the predictioning various outcomes, and downloading a confirmation of the accuracy of the model. Presentation SEMMA in the form of images can be seen below:


By assessing the results at each stage in the process SEMMA, the modeler can determine how me-models new questions triggered by the results of the previous process, and thus return to the exploration phase to the advanced screening of the data, as in the CRISP-DM, SEMMA driven by highly iterative cycle of experimentation. The main difference between the CRISP-DM and SEMMA is that CRISP-DM has a DM project approach to a more comprehensive understanding of the business and including relevant data, while SEMMA implicitly assumes that the goals and objectives of the DM project and its data sources have been identified and understood before.

6 comments:

  1. This post is really very informative. Thanks for sharing such a great knowledge.
    data mining services

    ReplyDelete
  2. Excellence blog! Thanks For Sharing, The information provided by you is really a worthy. I read this blog and I got the more information about
    data science courses

    ReplyDelete
  3. I love what you guys are up too. Such clever work and exposure! Keep up the very good works guys I’ve incorporated you guys to my own blogroll. os path abspath

    ReplyDelete

  4. Wonderful information! I admire the finished concept with the quality of the projects to be delivered. I agree that quality is the utmost priority in the tech world for clients. A month ago, I also persuaded the stage to hire a professional Javascript developer substituting my previous less-friend developers, which helped me enhance the quality of my projects. The credit goes to Eiliana.com, which lets me get the right experts to execute my approach.

    ReplyDelete