Decision Tree Applications

## 1. Introduction

The science of extracting useful information from large data sets or databases is named as data mining. Though data mining concepts have an extensive history, the term “Data Mining“, is introduced relatively new, in mid 90’s. Data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. All of these are concerned with certain aspects of data analysis, so they have much in common but each also has its own distinct problems and types of solution. The fundamental motivation behind data mining is autonomously extracting useful information or knowledge from large data stores or sets. The goal of building computer systems that can adapt to special situations and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience and cognitive science.

As opposed to most of statistics, data mining typically deals with data that have already been collected for some purpose other than the data mining analysis. Majority of the applications presented in this book chapter uses data formerly collected for any other purposes. Out of data mining research, has come a wide variety of learning techniques that have the potential to renovate many scientific and industrial fields.

This book chapter surveys the development of Data Mining through review and classification of journal articles between years 1996-now. The basis for choosing this period is that, the comparatively new concept of data mining become widely accepted and used during that period. The literature survey is based on keyword search through online journal databases on Science Direct, EBSCO, IEEE, Taylor Francis, Thomson Gale, and Scopus. A total of 1218 articles are reviewed and 174 of them found to be including data mining methodologies as primary method used. Some of the articles include more than one data mining methodologies used in conjunction with each other.

The concept of data mining can be divided into two broad areas as predictive methods and descriptive methods. Predictive methods include Classification, Regression, and Time Series Analysis. Predictive methods aim to project future status before they occur.

Section 2 includes definition of algorithms and the applications using these algorithms. Discussion of trends throughout the last decade is also presented in this section. Section 3 introduces Descriptive methods in four major parts; Clustering, Summarization, Association Rules and Sequence Discovery. The objective of descriptive methods is describing phenomena, evaluating characteristics of the dataset or summarizing a series of data. The application areas of each algorithm are documented in this part with discussion of the trend in descriptive methods. Section 4 describes data warehouses and lists their applications involving data mining techniques. Section 5 gives a summarization of the study and discusses future trends in data mining and contains a brief conclusion.

## 2. Predictive methods and applications

A predictive model makes a prediction about values of data using known results found from different data sets. Predictive modeling may be made based on the use of other historical data. Predictive model data mining tasks include classification, regression, time series analysis, and prediction (Dunham, 2003).

### 2.1. Classification methods

Classification maps data into predefined groups or classes. It is often referred to as supervised learning. Classification algorithms require that the classes be defined based on data attribute values. Pattern recognition is a type of classification where an input pattern is classified into one of several classes based on its similarity to these predefined classes (Dunham, 2003). In this section; decision trees, neural networks, Bayesian classifiers and support vector machines related applications are considered.

#### 2.1.1. Decision trees

Decision trees can be construct recursively. Firstly, an attribute is selected to place at root node to make one branch for each possible value. This splits up the example set into subsets, one for every value of the attribute (Witten, Frank; 2000).

The basic principle of tree models is to partition the space spanned by the input variables to maximize a score of class purity that the majority of points in each cell of the partition belong to one class. They are mappings of observations to conclusions (target values). Each inner node corresponds to variable; an arc to a child represents a possible value of that variable. A leaf represents the predicted value of target variable given the values of the variables represented by the path from the root (T. Menzies, Y. Hu, 2003).

Information entropy is used to measure the amount of uncertainty or randomness in a set of data. Gini index also used to determine the best splitting for a decision tree.

Decision trees can be divided into two types as regression trees and classification trees. The trend is towards the regression trees as they provide real valued functions instead of classification tasks. Applications include; Remote Sensing, Database Theory, Chemical engineering, Mobile communications, Image processing, Soil map modeling, Radiology, Web traffic prediction, Speech Recognition, Risk assessment, Geo information, Operations Research, Agriculture, Computer Organization, Marketing, Geographical Information Systems. Decision trees are growing more popular among other methods of classifying data. C5.0 algorithm by R.J. Quinlan is very commonly used in latest applications.

Decision tree applications | Authors |

2006 – Geographical Information Systems | Baisen Zhang, Ian Valentine, Peter Kemp and Greg Lambert |

2005 – Marketing | Sven F. Crone, Stefan Lessmann and Robert Stahlbock |

2005 – Computer Organization | Xiao-Bai Li |

2005 - Agriculture | Baisen Zhang, Ian Valentine and Peter D. Kemp |

2004 – Operations Research | Nabil Belacel, Hiral Bhasker Raval and Abraham P. Punnen |

2004 – Geoinformation | Luis M. T. de Carvalho, Jan G. P. W. Clevers, Andrew K. Skidmore |

2004 – Risk assessment | Christophe Mues, Bart Baesens, Craig M. Files and Jan Vanthienen |

2003 – Speech Recognition | Oudeyer Pierre-Yves |

2003 – Web traffic prediction | Selwyn Piramuthu |

2002 – Radiology | Wen-Jia Kuo, Ruey-Feng Chang, Woo Kyung Moon, Cheng Chun Lee |

2002 – Soil map modelling | Christopher J. Moran and Elisabeth N. Bui |

2002 – Image processing | Petra Perner |

2001 – Mobile communications | Patrick Piras, Christian Roussel and Johanna Pierrot-Sanders |

2000 – Chemical engineering | Yoshiyuki Yamashita |

2000 – Geoscience | Simard, M.; Saatchi, S.S.; De Grandi |

2000 – Medical Systems | Zorman, M.; Podgorelec, V.; Kokol, P.; Peterson, M.; Lane, J |

1999 – Database Theory | Mauro Sérgio R. de Sousa, Marta Mattoso and Nelson F. F. Ebecken |

1999 – Speech Processing | Padmanabhan, M.; Bahl, L.R.; Nahamoo, D |

1998 – Remote Sensing | R. S. De Fries M. Hansen J. R. G. Townshend R. Sohlberg |

#### 2.1.2. Neural networks

An artificial neural network is an interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionist approach to computation (Freeman et al., 1991). Formally the field started when neurophysiologist Warren McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work in 1943. They modeled a simple neural network using electrical circuits. In 1949, Donald Hebb pointed out the fact that neural pathways are strengthened each time they are used, a concept fundamentally essential to the ways in which humans learn. If two nerves fire at the same time, he argued, the connection between them is enhanced.

In 1982, interest in the field was renewed. John Hopfield of Caltech presented a paper to the National Academy of Sciences. His approach was to create more useful machines by using bidirectional lines. In 1986, with multiple layered neural networks appeared, the problem was how to extend the Widrow-Hoff rule to multiple layers. Three independent groups of researchers, one of which included David Rumelhart, a former member of Stanford’s psychology department, came up with similar ideas which are now called back propagation networks because it distributes pattern recognition errors throughout the network. Hybrid networks used just two layers, these back-propagation networks use many. Neural networks are applied to data mining in (Craven and Sahvlik 1997).

Neural Networks Applications | Authors |

2006 – Banking | Tian-Shyug Lee, Chih-Chou Chiu, Yu-Chao Chou and Chi-Jie Lu |

2005 – Stock market | J.V. Healy, M. Dixon, B.J. Read and F.F. Cai |

2005 – Financial Forecast | Kyoung-jae Kim |

2005 – Mobile Communications | Shin-Yuan Hung, David C. Yen and Hsiu-Yu Wang |

2005 – Oncology | Ta-Cheng Chen and Tung-Chou Hsu |

2005 – Credit risk assessment | Yueh-Min Huang, Chun-Min Hung and Hewijin Christine Jiau |

2005 – Enviromental Modelling | Uwe Schlink, Olf Herbarth, Matthias Richter, Stephen Dorling |

2005 – Cybernetics | Jiang Chang; Yan Peng |

2004 – Biometrics | Marie-Noëlle Pons, Sébastien Le Bonté and Olivier Potier |

2004 – Heat Transfer Engineering | R. S. De Frıes M. Hansen J. R. G. Townshend R. Sohlberg |

2004 – Marketing | YongSeog Kim and W. Nick Street |

2004 – Industrial Processes | X. Shi, P. Schillings, D. Boyd |

2004 – Economics | Tae Yoon Kim, Kyong Joo Oh, Insuk Sohn and Changha Hwang |

2003 – Crime analysis | Giles C. Oatley and Brian W. Ewart |

2003 – Medicine | Álvaro Silva, Paulo Cortez, Manuel Filipe Santos, Lopes Gomes and José Neves |

2003 – Production economy | Paul F. Schikora and Michael R. Godfrey |

2001 – Image Recognation | Kondo, T.; Pandya, A.S |

The research in theory has been slowed down; however applications continue to increase popularity. Artificial neural networks are one of a class of highly parameterized statistical models that have attracted considerable attention in recent years. Since the artificial neural networks are highly parameterized, they can easily model small irregularities in functions however this may lead to over fitting in some conditions. Applications of neural networks include; Production economy, Medicine, Crime analysis, Economics, Industrial Processes, Marketing, Heat Transfer Engineering, Biometrics, Environmental Modeling, Credit risk assessment, Oncology, Mobile Communications, Financial Forecast, Stock market, Banking.

#### 2.1.3. Bayesian classifiers

Bayesian classification is based on Bayes Theorem. In particular, naive Bayes is a special case of a Bayesian network, and learning the structure and parameters of an unrestricted Bayesian network would appear to be a logical means of improvement.

However, Friedman (1997) found that naive Bayes easily outperforms such unrestricted Bayesian network classifiers on a large sample of benchmark datasets. Bayesian classifiers are useful in predicting the probability that a sample belongs to a particular class or grouping. This technique tends to be highly accurate and fast, making it useful on large databases. Model is simple and intuitive. Error level is low when independence of attributes and distribution model is robust. Some often perceived disadvantages of Bayesian analysis are really not problems in practice. Any ambiguities in choosing a prior are generally not serious, since the various possible convenient priors usually do not disagree strongly within the regions of interest. Bayesian analysis is not limited to what is traditionally considered statistical data, but can be applied to any space of models (Hanson, 1996).

Application areas include; Geographical Information Systems, Database Management, Web services, Neuroscience. In application areas which large amount of data needed to be processed, technique is useful. The assumption of normal distribution of patterns is the toughest shortcoming of the model.

Bayessian Classifiers | Authors |

2005 – Neuroscience | Pablo Valenti, Enrique Cazamajou, Marcelo Scarpettini |

2003 – Web services | Dunja Mladeni and Marko Grobelnik |

1999 – Database Management | S. Lavington, N. Dewhurst, E. Wilkins and A. Freitas |

1998 – Geographical Information Systems | A. Stassopoulou, M. Petrou J. Kıttler |

#### 2.1.4. Support Vector Machines

Support Vector Machines are a method for creating functions from a set of labeled training data. The original optimal hyper plane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Boser, Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyper planes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyper plane in the transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyper plane in the high-dimensional feature space it may be non-linear in the original input space.

In 1995, Cortes and Vapnik suggested a modified maximum margin idea that allows for mislabeled examples. If there exists no hyper plane that can split the binary examples, the Soft Margin method will choose a hyper plane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples.

A version of a SVM for regression was proposed in 1997 by Vapnik, Golowich, and Smola. This method is called SVM regression. The model produced by classification only depends on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR only depends on a subset of the training data, because the cost function for building the model ignores any training data that is close (within a threshold ε) to the model prediction. The function can be a classification function or the function can be a general regression function. A detailed tutorial can be found in (Burges 1998).

For classification they operate by finding a hyper surface in the space of possible inputs. Applications of Support Vector Machines include; Industrial Engineering, Medical Informatics, Genetics, Medicine, Marketing.

Support Vector Mechanism | Authors |

2005 – Marketing | Sven F. Crone, Stefan Lessmann and Robert Stahlbock |

2004 – Medicine | Lihua Li, Hong Tang, Zuobao Wu, Jianli Gong, Michael Gruidl |

2004 – Genetics | Fei Pan, Baoying Wang, Xin Hu and William Perrizo |

2003 – Medical Informatics | I Kalatzis, D Pappas, N Piliouras, D Cavouras |

2002 – Industrial Engineering | Mehmed Kantardzic, Benjamin Djulbegovic and Hazem Hamdan |

### 2.2. Time series analysis and prediction applications

Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research (Liao, 2003). Unvariate and multivariate time series explained respectively.

#### 2.2.1. Univariate time series

The expression “univariate time series” refers to a time series that consists of particular observations recorded sequentially over equal time increments. Although a univariate time series data set is usually given as a single column of numbers, time is in fact an implicit variable in the time series. If the data are equi-spaced, the time variable, or index, does not need to be explicitly given. The time variable may sometimes be explicitly used for plotting the series. However, it is not used in the time series model itself. Triple exponential smoothing is an example of this approach. Another example, called seasonal loses, is based on locally weighted least squares and is discussed by Cleveland (1993). Another approach, commonly used in scientific and engineering applications, is to analyze the series in the frequency domain. The spectral plot is the primary tool for the frequency analysis of time series. Application areas includes financial forecasting, management, energy, economics, zoology, industrial engineering, emergency services, biomedicine, networks.

Univariate Time Series Applications | Authors |

2005 – Financial Forecasting | James W. Taylor and Roberto Buizza |

2005 – Enviromental Management | Peter Romilly |

2004 – Energy | Jesús Crespo Cuaresma, Jaroslava Hlouskova, Stephan Kossmeier |

2003 – Crime Rates Forecasting | Wilpen Gorr , Andreas Olligschlaeger and Yvonne Thompson |

2002 – Financial Economics | Per Bjarte Solibakke |

2001 – Unemployment Rates | Bradley T. Ewing and Phanindra V. Wunnava |

2001 – Forecasting | Juha Junttila |

2000 – Zoology | Christian H. Reick and Bernd Page |

1998 – Industrial Engineering | Gerhard Thury and Stephen F. Witt |

1998 – Emergency Medicine | Kenneth E Bizovi, Jerrold B Leikin, Daniel O Hryhorczuk and Lawrence J Frateschi |

1998 – Biomedicine | R. E. Abdel-Aal and A. M. Mangoud |

1997 – Economics | Hahn Shik Lee and Pierre L. Siklos |

1996 – Sensors | Stefanos Manganaris |

1996 – Economics | Apostolos Serletis and David Krause |

#### 2.2.2. Multivariate time series

Multivariate time series may arise in a number of ways. The time series are measuring the same quantity or time series depending on some fundamental quantity leads to multivariate series. The multivariate form of the Box-Jenkins univariate models is frequently used in applications. The multivariate form of the Box-Jenkins univariate models is sometimes called the ARMAV model, for AutoRegressive Moving Average Vector or simply vector ARMA process. Also, Friedman worked multivariate adaptive regression splines in 1991.

The application areas of the method include neurology, hydrology, finance, medicine, chemistry, environmental science, biology.

Multivariate Time Series Applications | Authors |

2006 – Neurology | Björn Schelter, Matthias Winterhalder, Bernhard Hellwig, Brigitte Guschlbauer |

2005 – Neurobiology | Ernesto Pereda, Rodrigo Quian Quiroga and Joydeep Bhattacharya |

2005 – Hydrology | R. Muñoz-Carpena, A. Ritter and Y.C. Li |

2005 – Neuroscience | Andy Müller, Hannes Osterhage, Robert Sowa |

2004 – Market analysis | Bernd Vindevogel, Dirk Van den Poel and Geert Wets |

2004 – Policy Modelling | Wankeun Oh and Kihoon Lee |

2004 – Medicine | Fumikazu Miwakeichi, Andreas Galka, Sunao Uchida, Hiroshi Arakaki |

2004 – Economics | Morten Ørregaard Nielsen |

2003 – Labor Force Forecasting | Edward W. Frees |

2003 – Statistical Planning | Hamparsum Bozdogan and Peter Bearse |

2002 – Chemistry | Jan H. Christensen |

2002 – Biomedicine | Stephen Swift and Xiaohui Liu |

2001 – Marine Sciences | Ransom A. Myers |

2001 – Reliability Engineering | S. Lu, H. Lu and W. J. Kolarik |

2000 – Environmental Science | Zuotao Li and Menas Kafatos |

### 2.3. Regression methods

Regression is generally used to predict future values base on past values by fitting a set of points to a curve. Linear regression assumes that a linear relationship exists between the input data and the output data. The common formula for a linear relationship;

y= c_{0}+c_{1x1}+…….+c_{nxn}

Here there are n input variables, that are called predictors or regressors; one output variable, that is called response and n + 1 constants which are chosen during the modeling process to match the input examples. This is sometimes called multiple linear regression because there is more than one predictor (Dunham, 2003).

Following subsections give explanations about non parametric, robust, ridge and nonlinear regressions.

#### 2.3.1. Nonparametric regression

Nonparametric regression analysis is regression without an assumption of linearity. The scope of nonparametric regression is very broad, ranging from smoothing the relationship between two variables in a scatter plot to multiple-regression analysis and generalized regression models. Methods of nonparametric-regression analysis have been rendered practical by advances in statistics and computing, and are now a serious alternative to more traditional parametric-regression modeling. Non-parametric regression is a type of regression analysis in which the functional form of the relationship between the response variable and the associated predictor variables does not to be specified in order to fit a model to a set of data. The applications are mostly in fields of medicine and biology. Also applications in economics and geography exist.

Non parametric Regression Applications | Authors |

2005 – Medicine | Hiroyuki Watanabe and Hiroyasu Miyazaki |

2005 – Veterinery | A.B. Lawson and H. Zhou |

2004 – Economics | Insik Min and Inchul Kim |

2004 – Geography | Caroline Rinaldi and Theodore M. Cole, III |

2004 – Biosystems | Sunyong Kim, Seiya Imoto and Satoru Miyano |

2003 – Surgery | David Wypij, Jane W. Newburger, Leonard A. Rappaport |

2002 – Environmental Science | Ronald C. Henry, Yu-Shuo Chang and Clifford H. Spiegelman |

2001 – Econometrics | Pedro L. Gozalo and Oliver B. Linton |

1999 – Econometrics | Yoon-Jae Whang and Oliver Linton |

#### 2.3.2. Robust regression

Robust regression in another approach, used to set a fitting criterion which is not vulnerable as other regression methods like linear regression. Robust regression analysis provides an alternative to a least squares regression model when fundamental assumptions are unfulfilled by the nature of the data (Yaffee, 2002). When the analyst estimates his statistical regression models and tests his assumptions, he frequently finds that the assumptions are substantially violated. Sometimes the analyst can transform his variables to conform to those assumptions. Often, however, a transformation will not eliminate or attenuate the leverage of influential outliers that bias the prediction and distort the significance of parameter estimates. Under these circumstances, robust regression that is resistant to the influence of outliers may be the only reasonable resource. The most common method is M-estimation introduced by Huber in 1964. Application areas varies as epidemiology, remote sensing, bio systems, oceanology, computer vision and chemistry.

Non parametric Regression Applications | Authors |

2005 – Epidemiology | Andy H. Lee, Michael Gracey, Kui Wang and Kelvin K.W. Yau |

2005 – Remote Sensing | Ian Olthof, Darren Pouliot, Richard Fernandes and Rasim Latifovic |

2004 – Biosystems | Federico Hahn |

2002 – Oceanology | C. Waelbroeck, L. Labeyrie, E. Michel, J. C. Duplessy, J. F. McManus |

2001 – Policy Modeling | Bradley J. Bowland and John C. Beghin |

1999 – Biochemistry | V. Diez, P. A. García and F. Fdz-Polanco |

1998 – Computer vision | Menashe Soffer and Nahum Kiryati |

1997 – Chemistry | Dragan A. Cirovic |

#### 2.3.3. Ridge Regression

Ridge regression, also known as Tikhonoy regularization, is the most commonly used method of regularization of ill-posed problems. A frequent obstacle is that several of the explanatory variables will vary in rather similar ways. As result, their collective power of explanation is considerably less than the sum of their individual powers. The phenomenon is known as near collinearity. Data mining application areas are frequently related with chemistry and chemometrics. Also applications in organizational studies and environmental science are listed in table.

Ridge regression Applications | Authors |

2005 – Atmospheric Environment | Steven Roberts and Michael Martin |

2005 – Epidemology | L.M. Grosso, E.W. Triche, K. Belanger, N.L. Benowitz |

2004 – Chemical Engineering | Jeffrey Dean Kelly |

2002 – Chemistry | Marla L. Frank, Matthew D. Fulkerson, Bruce R. Patton and Prabir K. Dutta |

2002 – Chemometrics | J. Huang, D. Brennan, L. Sattler, J. Alderman |

2001 – Laboratory Chemometrics | Kwang-Su Park, Hyeseon Lee, Chi-Hyuck Jun, Kwang-Hyun Park, Jae-Won Jung |

2000 – Food Industry | Rolf Sundberg |

1996 – Organizational Behaviour | R. James Holzworth |

#### 2.3.4. Nonlinear regression

Almost any function that can be written in closed form can be incorporated in a nonlinear regression model. Unlike linear regression, there are very few limitations on the way parameters can be used in the functional part of a nonlinear regression model. Nonlinear least squares regression extends linear least squares regression for use with a much larger and more general class of functions. Almost any function that can be written in closed form can be incorporated in a nonlinear regression model. Unlike linear regression, there are very few limitations on the way parameters can be used in the functional part of a nonlinear regression model. The way in which the unknown parameters in the function are estimated, however, is conceptually the same as it is in linear least squares regression. Application areas include Chromatography, urology, ecology and chemistry.

Nonlinear Regressin Applications | Authors |

2005 – Chromatography | Fabrice Gritti and Georges Guiochon |

2005 – Urology | Alexander M. Truskinovsky, Alan W. Partin and Martin H. Kroll |

2005 – Ecology | Yonghe Wang, Frédéric Raulier and Chhun-Huor Ung |

2005 – Soil Research | M. Mohanty, D.K. Painuli, A.K. Misra, K.K. Bandyopadhyaya and P.K. Ghosh |

2005 – Chemical Engineering | Vadim Mamleev and Serge Bourbigot |

2004 – Dental Materials | Paul H. DeHoff and Kenneth J. Anusavice |

2004 – Metabolism Studies | Lars Erichsen, Olorunsola F. Agbaje, Stephen D. Luzio, David R. Owens |

2004 – Biology | David D’Haese , Karine Vandermeiren , Roland Julien Caubergs , Yves Guisez |

2003 – Hydrology | Xunhong Chen and Xi Chen |

2003 – Chemo metrics | Igor G. Zenkevich and Balázs Kránicz |

2003 – Production Economics | Paul F. Schikora and Michael R. Godfrey |

2002 – Quality Management | Shueh-Chin Ting and Cheng-Nan Chen |

2001 – Agriculture | Eva Falge, Dennis Baldocchi, Richard Olson, Peter Anthoni |

2000 – Medicine | Marya G. Zlatnik, John A. Copland |

1999 – Pharmacology | Johan L. Gabrielsson and Daniel L. Weiner |

## 3. Descriptive methods and applications

The goal of a descriptive model is describe all of the data (or the process generating the data). Examples of such descriptions include models for the overall probability distribution of the data (density estimation), partitioning of the p-dimensional space into groups (cluster analysis and segmentation), and models describing the relationship between variables (dependency modeling). In segmentation analysis, for example, the aim is to group together similar records, as in market segmentation of commercial databases (Hand, et al., 2001).

### 3.1. Clustering methods and its applications

Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed, clustering is usually accomplished by determining the similarity among the data on predefined attributes (Dunham, 2003).

#### 3.1.1. K-means clustering

The k-means algorithm (MacQueen, 1967) is an algorithm to cluster objects based on attributes into k partitions. It is a variant of the expectation-maximization algorithm in which the goal is to determine the k means of data generated from Gaussian distributions. K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as bar centers of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. It assumes that the object attributes form a vector space. Application areas include sensor networks, web technologies, cybernetics.

K-means Clustering Applications | Authors |

2005 – E-commerce | R. J. Kuo, J. L. Liao and C. Tu |

2005 – Text clustering | Shi Zhong |

2005 – Peer to peer data streams | Sanghamitra Bandyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta |

2005 – Bioscience | Wei Zhong; Altun, G.; Harrison, R |

2004 – Image Processing | Mantao Xu; Franti, P |

2003 – Cybernetics | Yu-Fang Zhang; Jia-Li Mao |

2000- Adaptive Web | Mike Perkowitz and Oren Etzioni |

#### 3.1.2. Fuzzy c-means clustering

Fuzzy c-means is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function. The method is frequently used in pattern recognition. Application areas include ergonomics, acoustics, and manufacturing. Widely used in image processing.

C-means clustering Applications | Authors |

2006 – Ergonomics | Stéphane Armand, Eric Watelain, Moïse Mercier, Ghislaine Lensel |

2005 – Neurocomputing | Antonino Staiano, Roberto Tagliaferri and Witold Pedrycz |

2004 – Acoustics | Nitanda, N.; Haseyama, M.; Kitajima, H |

2001 – Manufacturing | Y. M. Sebzalli and X. Z. Wang |

2000 – Image Processing | Rezaee, M.R.; van der Zwet, P.M.J.; Lelieveldt, B.P.E.; van der Geest, R.J.; Reiber, J.H.C |

2000 – Signal Processing | Zhe-Ming Lu; Jeng-Shyang Pan; Sheng-He Sun |

2000 – Remote Sensing | Chumsamrong, W.; Thitimajshima, P.; Rangsanseri, Y |

1999 – Machine Vision | Gil, M.; Sarabia, E.G.; Llata, J.R.; Oria, J.P |

1998 – Bioelectronics | Da-Chuan Cheng; Kuo-Sheng Cheng |

### 3.2. Summarization

Summarization involves methods for finding a compact description for a subset of data. A simple example would be tabulating the mean and standard deviations for all fields (Bao, 2000). More sophisticated methods involve the derivation of summary rules, multivariate visualization techniques, and the discovery of functional relationships between variables. Summarization techniques are often applied to interactive exploratory data analysis and automated report generation.

Summarization applications | Authors |

2005 – Genetics | Howard J. Hamilton, Liqiang Geng, Leah Findlater |

2005 – Linguistics | Janusz Kacprzyk and Sławomir Zadrożny |

2003 – Decision Support Systems | Dmitri Roussinov and J. Leon Zhao |

### 3.3. Association rules

Association rule mining searches for interesting relationships among items in a given data set. This section provides an introduction to association rule mining introduction to association rule mining.

Let I={i1, i2,…,im} be a set of items. Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that

Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong (Han and Kamber, 2000).

#### 3.3.1. The Apriori algorithm

Apriori employs breadth-first search and uses a hash tree structure to count candidate item sets efficiently. The algorithm generates candidate item sets (patterns) of length k from k − 1 length item sets. Then, the patterns which have an infrequent sub pattern are pruned. If an item set is frequent, then all of its subsets must also be frequent. Apriori principle holds due to the following property of the support measure; support of an item set never exceeds the support of its subsets.

Apriori algorithm Applications | Authors |

2004 – MIS | Ya-Han Hu and Yen-Liang Chen |

2004 – CRM | Tzung-Pei Hong, Chan-Sheng Kuo and Shyue-Liang Wang |

2004 – Banking | Nan-Chen Hsieh |

2004 – Methods Engineering | Shichao Zhang, Jingli Lu and Chengqi Zhang |

2002 – Thermodynamics | K. T. Andrews, K. L. Kuttler, M. Rochdi |

#### 3.3.2. Multidimensional association rules

A top-down strategy is to be used for multi-level association rules considering more than one dimension of the data.

Multi Dimensional Association Rule Applications | Authors |

2003 – Behavioral Science | Ronald R. Holden and Daryl G. Kroner |

2003 – Health Care | Joseph L. Breault, Colin R. Goodall and Peter J. Fos |

#### 3.3.3. Quantitative association rules

In practice most databases contain quantitative data and are not limited to categorical items only. Unfortunately, the definition of categorical association rules does not translate directly to the quantitative case. It is therefore necessary to provide a definition of association rules for the case of a database containing quantitative attributes. (Srikant and Agrawal 1996) extended the categorical definition to include quantitative data. The basis for their definition is to map quantitative values into categorical events by considering intervals of the numeric values. Thus, each basic event is either a categorical item or a range of numerical values.

Quantitive Association Rule Applications | Authors |

2001 – Cybernetics | Ng, V.; Lee, J |

2001 – Database Management | Shragai, A.; Schneider, M |

2002 – Control and Automation | Tian Yongqing; Weng Yingjun |

#### 3.3.4. Distance-based association rules

Distance Based Association Rule Mining can be applied in data mining and knowledge discovery from genetic, financial, retail, time sequence data or any domain which distance information between items is of importance.

Distance Based Association Rule Applications | Authors |

2004 – Cardiology | Jeptha P. Curtis , Saif S. Rathore , Yongfei Wang and Harlan M. Krumholz |

2003 – Computational Statistics | Thomas Brendan Murphy |

1999 – Decision Support Systems | Daniel Boley, Maria Gini, Robert Gross |

### 3.4. Sequence discovery

Sequence discovery is similar to association analysis, except that the relationships among items are spread over time. In fact, most data mining products treat sequences simply as associations in which the events are linked by time (Edelstein, 1997) In order to find these sequences, not only the details of each transaction should be captured, but also actors are needed to be identified. Sequence discovery can also take advantage of the elapsed time between transactions that make up the event.

Sequence Discovery Applications | Authors |

2003 – Cellular Networks | Haghighat, A.; Soleymani, M.R |

2003 – System Sciences | Ming-Yen Lin; Suh-Yin Lee |

2002 – Biochemistry | Gilles Labesse, Dominique Douguet, Liliane Assairi and Anne-Marie Gilles |

1998 – Chemical Biology | Molly B Schmid |

## 4. Data mining and data warehouses

A data warehouse is an integrated collection of data derived from operational data and primarily used in strategic decision making by means of online analytical processing techniques (Husemann and et al., 2000) The data mining database may be a logical rather than a physical subset of your data warehouse, provided that the data warehouse DBMS can support the additional resource demands of data mining. If it cannot, then you will be better off with a separate data mining database. Data warehouses were emerged from the need to analyze large amount of data together. In the 1990's as organizations of scale began to need more timely data about their business, they found that traditional information systems technology was simply too cumbersome to provide relevant data efficiently and quickly. From this idea, the data warehouse was born as a place where relevant data could be held for completing strategic reports for management. As with all technologic development, over the last half of the 20th century, increased numbers and types of databases were seen. Many large businesses found themselves with data scattered across multiple platforms and variations of technology, making it almost impossible for any one individual to use data from multiple sources. A key idea within data warehousing is to take data from multiple platforms and place them in a common location that uses a common querying tool. In this way operational databases could be held on whatever system was most efficient for the operational business, while the reporting / strategic information could be held in a common location using a common language. Data Warehouses take this even a step farther by giving the data itself commonality by defining what each term means and keeping it standard. All of this was designed to make decision support more readily available and without affecting day to day operations. One aspect of a data warehouse that should be stressed is that it is not a location for all of a businesses data, but rather a location for data that is subject to research. In last few years, corporate database producers adopted data mining techniques for use on customer data. It is an important part of CRM services today. Some other application areas include; Production Technologies, Supply Chain Management, Business Management, Computer Integrated Manufacturing, Power Engineering, Web Management, Biology, Oceanography Financial Services Human Resources Machinery Fault Diagnosis, Bio monitoring, Banking.

Data Mining in Data Warehouses | Authors |

2006 – Production Technologies | Pach, F.P.; Feil, B.; Nemeth, S.; Arva, P.; Abonyi, J. |

2005 – Supply Chain Management | Mu-Chen Chen and Hsiao-Pin Wu |

2005 – Business Management | Nenad Jukić and Svetlozar Nestorov |

2005 – CRM | Bart Larivière and Dirk Van den Poel |

2005 – Stock Market | Adam Fadlalla |

2005 – Computer Integrated Manufactoring | Ruey-Shun Chen; Ruey-Chyi Wu; Chang, C.C |

2005 – Customer Analysis | Wencai Liu; Yu Luo |

2005 – Power Engineering | Cheng-Lin Niu; Xi-Ning Yu; Jian-Qiang Li; Wei Sun |

2004 – Web Management | Sandro Araya, Mariano Silva and Richard Weber |

2004 – Biology | Junior Barrera, Roberto M Cesar-, Jr , João E. Ferreira and M.D.Marco D. Gubitoso |

2004 – Oil refinery | A. A. Musaev |

2004 – Real Estates | Wedyawati, W.; Lu, M.; |

2004 – Electrical Insulation | Jian Ou; Cai-xin Sun; Bide Zhang |

2004 – Electrical Engineering | Wang, Z |

2003 – Management | Qi-Yuan Lin, Yen-Liang Chen, Jiah-Shing Chen and Yu-Chen Chen |

2003 – Corporate Databases | Nestorov, S. Jukic, N. |

2002 – Oceanography | Nicolas Dittert, Lydie Corrin, Michael Diepenbroek, Hannes Grobe, Christoph Heinze and Olivier Ragueneau |

2002 – Financial Services | Zhongxing Ye; Xiaojun Liu; Yi Yao; Jun Wang; Xu Zhou; Peili Lu; Junmin Yao |

2002 – Corporate Databases | Hameurlain, A.; Morvan, F. |

2002 – Human Resources | Xiao Hairong; Zhang Huiying; Li Minqiang; |

2002 – Medical Databases | Miquel, M.; Tchounikine, A |

2002 – Machinery Fault Diagnosis | Dong Jiang; Shi-Tao Huang; Wen-Ping Lei; Jin-Yan Shi |

2000 – Biomonitoring | A. Viarengo, B. Burlando, A. Giordana, C. Bolognesi and G. P. Gabrielides |

1999 – Banking | Gerritsen, R |

## 5. Conclusion

The purpose of data mining techniques is discovering meaningful correlations and formulations from previously collected data. Many different application areas utilize data mining as a means to achieve effective usage of internal information. Data mining is becoming progressively more widespread in both the private and public sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. In the public sector, data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance.

Sort of the techniques like decision tree models, time series analysis and regression were in use before the term data mining became popular in the computer science society. However, there are also techniques found by data mining practitioners in the last decade; Support Vector Machines, c-means clustering, Apriori algorithm, etc.

Many application areas of predictive methods are related with medicine fields and became increasingly popular with the rise of biotechnology in the last decade. Most of the genetics research depends heavily on data mining technology, therefore neural networks, classifiers and support vector machines will continue to increase their popularity in near future.

Descriptive methods are frequently used in finance, banking and social sciences to describe a certain population such as clients of a bank, respondents of a questionnaire, etc. Most common technique used for description is clustering; in the last decade k-means method has lost popularity against c-means algorithm. Another common method is association rules where Apriori is the most preferred method by far. By increasing importance of corporate databases and information centered production phenomena association rules continue to increase their growth. Sequence discovery is also a growing field nowadays.

Another aspect of subject discussed in this paper was exploiting data warehouses in conjunction with techniques listed. It is expected that data warehousing and usage of data mining techniques will become customary among corporate world in following years. Data warehouses are regularly used by banks, financial institutions and large corporations. It is unsurprising that they will spread through industries and will be adopted by also intermediate sized firms.