The K -Means Algorithm Evolution

Clustering is one of the main methods for getting insight on the underlying nature and structure of data. The purpose of clustering is organizing a set of data into clusters, such that the elements in each cluster are similar and different from those in other clusters. One of the most used clustering algorithms presently is K -means, because of its easiness for interpreting its results and implementation. The solution to the K -means clustering problem is NP-hard, which justifies the use of heuristic methods for its solution. To date, a large number of improvements to the algorithm have been proposed, of which the most relevant were selected using systematic review methodology. As a result, 1125 documents on improvements were retrieved, and 79 were left after applying inclusion and exclusion criteria. The improvements selected were classified and summarized according to the algorithm steps: initialization, classification, centroid calculation, and convergence. It is remarkable that some of the most successful algorithm variants were found. Some articles on trends in recent years were included, concerning K -means improvements and its use in other areas. Finally, it is considered that the main improvements may inspire the development of new heuristics for K -means or other clustering algorithms.


Introduction
The accelerated progress of technology in recent time is fostering an important increase in the amount of generated and stored data [1][2][3][4] in fields such as engineering, finance, education, medicine, and commerce, among others. Therefore, there is justified interest in obtaining useful knowledge that can be extracted from those huge amounts of data, in order to help making better decisions and understanding the nature of data. Clustering is one of the fundamental techniques for getting insight on the underlying nature and structure of data. The purpose of clustering is organizing a set of data into clusters whose elements are similar to each other and different from those in other clusters.
One of the clustering algorithms more widely used to date is K-means [5], because of its easiness for interpreting its results and implementation. Another factor that has contributed to its use is the existence of versions implemented in the Weka and SPSS platforms and open-source programming languages such as Python and R, among others.
It is convenient to point out that K-means is a family of algorithms that were developed in the 1950s as a result of independent investigations. These algorithms have in common four processing steps, with some differences in each step. It was in an article by MacQueen [6] where the name K-means was coined.
The solution to the K-means clustering problem is hard, and it has been proven that it is NP-hard, which justifies the use of heuristic methods for its solution. According to the no free lunch theorem, there is no algorithm that is superior to other algorithms for all types of instances of an NP-complete problem. This has limited the design of a general algorithm for the clustering problem. For more than 60 years, a large number of variants of the algorithm have been proposed. There exist some surveys on K-means and its improvements. Classical surveys are [7] that synthesize K-means variants and their results, and in [8] a historical review is presented of continuous and discrete variants of the algorithm. In [9] several clustering methods and key aspects on clustering algorithm design are summarized, and a remarkable list of challenges and research directions on K-means was proposed. In [10] a review of theoretical aspects on K-means and scalability for Big Data is presented. Unlike these surveys, this documentary research focuses on classifying the K-means improvements according to the algorithm steps. This classification is particularly useful for designing versions customized of K-means for solving certain types of problem instances. This is also a contribution to the knowledge on the most important improvements for each step and, in general, to the behavior of the algorithm.
For selecting the most relevant articles, systematic review methodology was used. The filters used and the analysis of results allowed finding some of the most successful and referenced algorithm variants for each step of the algorithm.
The chapter is organized as follows: Section 2 summarizes the pioneering works that originated the family of K-means-type algorithms; additionally, it describes the standard algorithm and the formulation of the clustering problem. Section 3 describes the application of systematic review methodology for retrieving the most important articles on K-means; it also includes the step or steps to which the improvements apply and tables that summarize the number of articles; lastly, it includes a subsection on the new trends on the use of K-means. Finally, Section 4 includes the conclusions, highlighting the most successful and referenced algorithm variants.

Origins of the family of K-means-type algorithms
During the decades of the 1950s and 1960s, several K-means-type algorithms were proposed. These proposals were developed independently by researchers from different disciplines [8]. These algorithms had in common a process that originated what is currently known as the K-means algorithm.

Lloyd (1957)
Stuart Lloyd, from Bell Laboratories, in the article titled "Least Squares Quantization in PCM" [12] approached the problem of transmitting a random signal X in a multidimensional space. Lloyd worked in the communications and electronics fields, and its algorithm is presented as a technique for pulse-code modulation.

MacQueen (1967)
James MacQueen, from Department of Statistics of the University of California, in his article titled "Some Methods for Classification and Analysis of Multivariate Observations" [6], proposed an algorithm for partitioning an instance into a set of clusters whose variance was small for each cluster. The term K-means was coined by him; it was known by different names: dynamic clustering method [13][14][15], iterative minimum-distance clustering [16], nearest centroid sorting [17], and h-means [18], among others.

Jancey (1966)
Jancey, from the Department of Botany, School of Biological Sciences, University of Sydney, in one of his articles titled "Multidimensional Group Analysis" [19], presented a clustering method for characterizing species Phyllota phylicoides. Jancey conducted his research in the field of taxonomy. There exists a variant of this method with similar characteristics, which was introduced by Forgy in the article "Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classification" [20]. The fundamental difference with respect to Jancey's work lies in the way in which the initial centroids are selected.
Because the results from Jancey's research will be used as reference for this chapter, his algorithm will be described in detail. The author stated that the similarity measures are based on the results published by the following authors: (a) Pearson in his article titled "On the Coefficient of Racial Likeness [21] published in 1926," (b) Rao in the article named "The Use of Multiple Measurements in Problems of Biological Classification" [22] published in 1948, and (c) Sokal in his article titled "Distance as a Measure of Taxonomic Similarity" [23] in 1961.
Pearson [21] in his article "On the Coefficient of Racial Likeness," when studying craniology and physical anthropology, confronted the difficulty of comparing two types of races, in order to determine the membership of a limited number of individuals to one race or another or both. As a result, Pearson proposed a coefficient of racial likeness (CRL). For calculating this coefficient, it is necessary to obtain first the means and variances of each characteristic in each sample, since it is assumed that there is variability for each of the characteristics considered. This coefficient is used to measure the dispersion around the mean and the degree of association between two variables.
The article published by Radhakrishna Rao [22] in the Journal of the Royal Statistical Society, titled "The Utilization of Multiple Measurements in Problems of Biological Classification," aimed at presenting a statistical approach for two types of problems that arise in biological research. The first deals with the determination of an individual as member of one of the many groups to which he/she possibly might belong. The second problem deals with the classification of groups into a system based on the configuration of their different characteristics.
Sokal [23] published his article titled "Distance as Measure of Taxonomic Similarity," which is based on the methods for quantifying the taxonomy classification process, and he points out the importance of having fast processing and data calculation methods. The purpose of his work is to evaluate the similarities among taxa that have observed characteristic values, instead of phylogenetic speculations and interpretations.
The similarity among objects is evaluated based on many attributes, and all the attributes are considered as equal taxonomic values; therefore, an attribute is not weighted more or less than any other.
For performing the weighting of attributes, three types of coefficients are used: association, correlation, and distance, where the last one is of interest for this study. This distance coefficient is employed for determining the similarity between two objects by using a distance function in an n-dimensional space, in which the coordinates represent the attributes.
A measure of similarity between the objects 1 and 2 based on two attributes would be the distance in a two-dimensional space (i.e., a Cartesian plane) between the two objects. This distance δ 1,2 can be easily calculated through the well-known formula from analytic geometry, Eq. (1): where X 1 and Y 1 are the object 1 coordinates and X 2 and Y 2 are the object 2 coordinates.
Similarly, when three attributes are needed for two different objects, it is now necessary to carry out the distance calculation in a three-dimensional space so that the exact position of the two objects can be represented regarding the three attributes. For calculating the distance between these two objects, an extension to the three-dimensional space of the formula for δ 1,2 can be applied. When more than three dimensions are needed for the objects, it is not possible to represent their positions using conventional geometry; therefore, it is necessary to resort to algebraic calculation of data. However, the formula for distance calculation from analytic geometry is equally valid for an n-dimensional space.
The general formula for calculating the distance for two objects with n attributes is shown in Eq. (2): where X ij is the value of attribute i for object j (j = 1, 2). Once the object classification process is completed, then the matrix of similarity coefficients obtained (based on object distances) can be used in the usual methods for clustering analysis.
Finally, it is important to emphasize the feasibility of calculating distance as the summation of the squared differences of the attribute values of objects of different kinds.
The clustering method proposed by Jancey consists of the following four steps: 3. Centroid calculation. New centroids are calculated using the mean value of the objects that belong to each cluster.
4.Convergence. The algorithm stops when equilibrium is reached, i.e., when there are no object migrations from one cluster to another. While no equilibrium is reached, the process is repeated from step 2.

K-means algorithm
K-means is an iterative method that consists of partitioning a set of n objects into k ≥ 2 clusters, such that the objects in a cluster are similar to each other and are different from those in other clusters. In the following paragraphs, the clustering problem related to K-means is formalized.
Let N = {x 1 , …, x n } be the set of n objects to be clustered by a similarity criterion, where x i ∈ ℜ d for i = 1, …, n and d ≥ 1 is the number of dimensions. Additionally, let k ≥ 2 be an integer and K = {1, …, k}. For a k-partition, Ρ = {G(1), …, G (k)} of N, let μ j denote the centroid of cluster G(j), for j ∈ K, and let M = {μ 1 , …, μ k } and W = {w 11 , …, w ij }.
Therefore, the clustering problem can be formulated as an optimization problem [24], which is described by Eq. (3): where w ij = 1 implies object x i belongs to cluster G(j) and d(x i , μ j ) denotes the Euclidean distance between x i and μ j for i = 1,…, n and j = 1,…, k.
The standard version of the K-means algorithm consists of four steps, as shown in Figure 1.
The pseudocode of the standard K-means algorithm is shown in Algorithm 1. # Classification:

5:
For x i ∈ N and μ k ∈ M 6: Calculate the Euclidean distance from each x i to the k centroids; Figure 1.
Standard K-means algorithm.
It has been shown that the clustering problem belongs to the NP-hard class for k ≥ 2 or d ≥ 2 [25,26]. Therefore, obtaining an optimal solution for an instance of moderate size is generally an intractable problem. Consequently, a variety of heuristic algorithms have been proposed for obtaining the closest possible solution to the optimum of P, being the most important of those designed as K-means-type algorithms [6].
It is important to emphasize that the establishment of useful gaps between the optimal solution of the problem P and the solution achieved by K-means remains an open research problem.
The computational complexity of K-means is O(nkdr), where r represents the number of iterations [8,9], which restricts its use for large instances, because each iteration involves the calculation of all the object-centroid distances. For reducing the complexity of K-means, numerous investigations have been carried out using different strategies for reducing the computational cost and minimizing the objective function.

Classification of articles on K-means improvements according to the algorithm steps
This section presents a classification of the most relevant articles on improvements to K-means regarding the algorithm steps. The articles were selected applying the systematic review methodology.

Systematic review process
The search for articles was carried out using four highly prestigious databases: Springer Link, ACM, IEEE Xplore, and ScienceDirect. Queries were issued to these databases using the following search string: As a result of the queries, 1125 articles were retrieved related to the K-means algorithm and its improvements. Next, inclusion and exclusion criteria were applied, which reduced the number of articles to 79. The remaining articles were classified according to the algorithm steps as shown in the following subsections.
The flow chart in Figure 2 shows the steps of the process carried out for selecting the articles. In step "a," the database queries were issued, and a document list was generated, and in step "b," duplicate articles were identified and eliminated. In step "c," based on article titles, those irrelevant to this research were identified and discarded. In step "d," article abstracts were analyzed, and those with little affinity to the subject of study were excluded. In step "e," those documents written in languages different from English or Spanish were eliminated. In step "f," those articles that did not describe an improvement process were discarded. In step "g," the text of the articles was reviewed, and those with little affinity to the subject of study were excluded. In step "h," four articles were eliminated because of possible plagiarism. Finally, in step "i," articles with a small number of citations were discarded; specifically, those with citations below a threshold adjusted by year and category.  The K-Means Algorithm Evolution DOI: http://dx.doi.org/10.5772/intechopen.85447

Article classification
As a result of the analysis of the articles, those addressing an improvement to one of the algorithm steps were identified. However, several works were found that involved improvements to more than one step; therefore, the following groups or categories were defined: (a) initialization, (b) classification and centroid calculation, (c) convergence, (d) convergence and initialization, and (e) convergence and classification.
In Figure 3, the number of articles for each of the aforementioned groups is shown. Notice that the step with the most articles is initialization and the step with the least attention by researchers is convergence. In the following subsections, the most important articles in each group are briefly described.

Initialization
The initialization step has received the most attention by researchers, because the algorithm is sensitive to the initial position of the centroids; i. e., different initial centroids may yield different resulting clusters. Consequently, a good initial selection might find a better solution and a reduction in the number of iterations needed by the algorithm to converge.
For this step 38 documents were found about improvements proposed for generating better initial centroids. Table 1 summarizes information on the articles for this step. Column 1 shows the articles for this step. Columns 2 through 5 indicate the different strategies that researchers have used for obtaining improvements for this step. Finally, column 6 shows the number of articles for each of the strategies.
The second row shows articles on approaches that perform a preprocessing for generating the initial centroids by using particular algorithms or methods. The third row includes articles on methods based on information on data set. The fourth row shows articles on techniques that involve more effective data structures. Finally, the fifth row includes articles where the improvements use other strategies.

Introduction to Data Science and Machine Learning
In the rest of this subsection, some of the most important works on the initialization step are summarized. Several of these works mentioned can be used in other algorithms similar to K-means for selecting the initial centroids.
In [27] a clustering method is proposed, where the centroids of the final clusters are used as initial centroids for K-means. The main idea is to randomly select an object x, which is used as a first initial centroid; from this object, the following kÀ1 centroids are selected considering a distance threshold set by the user.
In [28] a modification to Lloyd's work [12] is developed in the field of quantization. The main idea is that objects that are farther from each other have a larger probability of belonging to different clusters; therefore, the strategy proposed consists in choosing an object with the largest Euclidean distance to the rest of the objects for being the first centroid. The following i-th centroids (i = 2, 3, …, k) will be selected in decreasing order with respect to the largest distance to the first centroid.
In [29] two initialization methods are developed, which are aimed at being applied to large data sets. The proposed methods are based on the densities of the data space; specifically, they need first to divide uniformly the data space into M disjoint hypercubes and to randomly select kN m /N objects in hypercube m (m = 1, 2, …, M) for obtaining a total of k centroids.
In [30] a preprocessing is performed called refining. This method consists in using K-means for solving M samples of the original data set. The results of SSQ (sum of squared distances) are compared for each of the M solutions, and from the solution with the smallest value, the set of final centroids is extracted, which are used as the initial centroids for solving the entire instance using K-means.
In [31] a preprocessing method is proposed which uses a selection model based on statistics. In particular, it uses the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) for selecting the set of objects that will be used as initial centroids.
In [42] an algorithm is presented based on two main observations, which state that the more two objects are similar to each other, the largest the possibility that they end up assigned to the same cluster, so they are discarded from the selection of initial centroids. This method is based on density-based multiscale data condensation (DBMSDC) and allows identifying regions with large data densities, and afterwards a list is generated sorting the values by density. Next, select an object according to the sorted list as the first initial centroid, and all the objects that have a ratio inversely proportional to the density of the selected object are discarded. Afterwards, the second centroid is selected as the next object in the list that has not been eliminated, and its surrounding objects are excluded. This process is repeated until all the initial centroids needed are obtained.
A variance-based method for finding the initial centroids is proposed in [44]. First, the method calculates the variances of the values for each dimension, and it selects the dimension with the largest variance. Next, it sorts the objects by the values on the dimension with the largest variance. Finally, it creates k groups with the same number of objects each, and for each group it calculates the median. The medians constitute the initial centroids.
Other researchers have focused their works on using information on data set, such as the distribution of objects and statistical information of them, among others.
In [45] a method is proposed for randomly generating the initial centroids as described next: the first centroid is randomly generated using a uniform probability distribution for the objects; subsequent centroids (i = 2, 3, …, k) are generated calculating a probability that is proportional to the square of their minimal distances to the set of previously selected centroids (1, …, i-1).
In [50] a method is proposed, which is based on a sample of the data set for which an average is calculated. Next, the objects whose distance is larger than the average are identified, and a distance-between-objects criterion is applied for selecting the objects that will constitute the initial objects. The authors claim that this method obtains good results regarding time and solution quality when solving large data sets.
In [55] a method is proposed for eliminating those objects that may cause noise, as well as outliers. The method determines the most dense region of the data space, from which it locates the best initial centroids.
By using particular data structures, in [59] a method is presented for estimating the data density in different locations of the space by using kd-tree type structures.
Other researchers [58,60] have used a combination of genetic algorithms and Kmeans for the initialization step; however, this method has high computational complexity, since it is necessary to execute the K-means algorithm on the entire set of objects of the population, for each of the generations of the genetic algorithm.

Classification
Of the four steps of the algorithm, classification is the most time-consuming, because for each object it is necessary to calculate the distance from each object to each centroid.
The 33 articles that are the best proposals for this step are classified in Table 2. Column 1 shows the articles related to this step. Columns 2 through 6 indicate the different strategies that researchers have used for achieving improvements for this step. Finally, column 7 shows the number of articles for each of these strategies aiming at reducing the number of calculations of object-centroid distances.
The second row shows articles on approaches that use compression thresholds. The third row includes articles on methods that use information from the initialization step. The fourth row shows articles on techniques that involve more efficient data structures. The fifth row includes articles that present mathematical/statistical processes. Finally, the sixth row shows articles where the improvements use other strategies.
In [64] an improvement is proposed, which reduces the number of calculations of object-centroid distances. For this purpose, an exclusion criterion is defined based on the information of object-centroid distances in two successive iterations: i and i + 1. This criterion allows to exclude an object x from the distance calculations to the rest of the centroids, if it is satisfied that the distance to the centroid in iteration i + 1 is smaller than that of iteration i.
In [69] a heuristic is proposed, which reduces the number of objects considered in the calculations of object-centroid distances; i.e., the objects with small probability of changing cluster membership are excluded. The rationale behind this heuristic derives from the observation that objects closest to a centroid have a small probability of changing cluster membership, whereas those closer to the cluster border have higher probability of migrating to another cluster. The heuristics determine a threshold for deciding which objects should be excluded. The calculation of the threshold is defined as the sum of the two largest centroid shifts with respect to the previous iteration. Another work with a similar strategy is presented in [66].
In [72] an improvement is presented, which allows excluding from the calculation of object-centroid distances those objects in clusters that have not had object migrations in two successive iterations. This type of clusters is called stable, and the objects in such clusters keep their membership unchanged for the rest of the iterations of the algorithm. In the article [73], a similar strategy is presented.
In [74] an improvement to fast global K-means algorithm is proposed, which is based on the cluster membership and the geometrical information of the objects. This work also includes a set of inequalities that are used to determine the initial centroids.
In [79] a heuristic is presented, which reduces the number of calculations of object-centroid distances. Specifically, it calculates the distances for each object only to those centroids of clusters that are neighbors of the cluster where the object belongs. This heuristic is based on the observation that objects can only migrate to neighboring clusters.
One of the most representative works for this step is presented in [82], where an improvement is proposed, called filtering algorithm (FA), which uses data structures of binary tree type, called kd-tree. Each node of the tree is associated to a set of objects called cell. An improvement of this work is described in [85], where the authors claim that it reduces execution time by 33.6% with respect to algorithm FA. Another remarkable improvement to the work in [82] is presented in [83].

Centroid calculation
Centroid calculation was defined as another step in this analysis, because there exist two variants for this step that differentiate two types of the K-means algorithm. In one of the types, centroid calculations are performed once all the objects have been assigned to one cluster. This type of calculation method is called batch and is used by [12,19], among others. The second type of calculation is performed each time an object changes cluster membership. This type of calculation is called linear K-means and was proposed by MacQueen. No documents were retrieved related to this step.

Convergence
The convergence step of the algorithm has received little attention by researchers, which is manifested by the small number of papers on this subject. It is  Table 2.
Summary of information on the articles for classification.

11
The K-Means Algorithm Evolution DOI: http://dx.doi.org/10.5772/intechopen.85447 worth mentioning that in recent years, research on this step has produced very promising results concerning the reduction of algorithm complexity at the expense of a minimal reduction of solution quality. A pioneering work for this step was presented in [97]. The main contribution consisted in associating the values of the squared errors to the stop criterion of the algorithm. In particular, the proposed condition for stopping is when, in two successive iterations, the value of the squared error in one iteration is larger than in the previous one, which guarantees that the algorithm stops at the first local optimum.
Other articles for this step are [98,99], from the point of view of mathematical analysis, aiming at proving when does the solution obtained reach a global optimum.

Convergence and initialization
This subsection summarizes two works on improvements for the convergence and initialization steps.
In [100] the stop criterion is associated to the number k of clusters; i.e., in each iteration a new initial centroid is generated for creating a new cluster. This stop criterion is called incremental.
In [101] convergence is reached in two ways. In the first condition, the algorithm stops when it reaches a predefined number of iterations. In the second, the algorithm stops when there is no region with a density value larger than a predefined threshold. It is important to mention that in each iteration, the algorithm creates a new cluster guided by the density value in a region.

Convergence and classification
In this subsection two works are summarized, which present improvements for the convergence and classification steps.
In [102] a stop criterion is proposed, which stops the algorithm when, in ten consecutive iterations, the difference of the squared errors, between iterations i and i + 1, does not exceed a predefined threshold.
The work presented in [103] proposes an optimization by integrating the core (classification) of the K-means algorithm and multiple kernel learning using support-vector machines (LS-SVM). By using the Rayleigh coefficient, it optimizes the separation among each group. This algorithm reaches local convergence by obtaining the maximal separation among each of the centroids.

Trends
Preceding sections include articles published from the origins of the algorithm up to 2016. This section includes three types of articles: recently published articles on important improvements to K-means, articles that propose improvements to the algorithm implementation using parallel and distributed computing, and articles for new applications of the algorithm.
Regarding improvements to the steps of K-means, several of recently published articles are summarized next. In [104] two algorithm improvements are proposed: one deals with the outliers and noise data problems, and the other deals with the selection of initial centroids. In [105] two problems are dealt with: the selection of initial centroids and the determination of the number k of clusters. In [106] a new measure of distance or similarity between objects is proposed. In [107] an improvement to the work in [69] is proposed, by defining a new criterion for reducing the processing time of the assignment of objects to clusters. This approach is particularly useful for large instances. In [108] a new stop criterion is proposed, which reduces the number of iterations. In [109] an improvement is proposed for the convergence step of the algorithm aimed at solving large instances. The improvement consists of a new criterion that balances processing time and the quality of the solution. The main idea is to stop the algorithm when the number of objects that change membership is smaller than a threshold.
Recently, in the specialized literature, the parallelization of K-means has been proposed by using MapReduce paradigm [110,111], which makes possible to process efficiently large instances. In [112] a method is proposed for parallelizing the K-means++ algorithm [45], which has shown good results for obtaining the initial centroids.
In recent years, a trend has been observed for modifying K-means oriented to new applications, in particular, its application to natural language and text processing. In this regard, one of the remarkable works is presented in [113], in which a modification to K-means is proposed for grouping bibliographic citations. Later, in [114] an improvement is proposed to the algorithm in [113], in order to solve recommendation problems. In [115,116] an improvement to K-means is proposed for the field of natural language.

Conclusions
This chapter presents three aspects of the K-means algorithm: (a) the works that originated the family of K-means algorithms, (b) a systematic review on the algorithm improvements up to 2016, and (c) some of the most recent publications that describe the prospective uses of the algorithm.
Regarding the origin of K-means, it is worth mentioning that it is not only an algorithm but a family of algorithms with the same purpose, which were developed independently in the decades of the 1950s and 1960s.
The systematic review process involved accessing four large databases, from which 1125 documents were retrieved. After applying inclusion and exclusion criteria, 79 documents remained.
Next, we will mention the most important observations organized by subjects: 1. Initialization. Of the four steps of the algorithm, initialization is the step on which the largest number of investigations has focused. The reason for this interest is that the algorithm is highly sensitive to the initial positions of the centroids. Some of the most cited publications are [45,59].
2. Classification. Most of the works related to this step aim at reducing the number of calculations of object-centroid distances by applying heuristic methods. Some of the most cited works are [73,82]. Some promising and recently published articles are [72,107].
3. Centroid calculation. No documents were retrieved related to this step.

4.
Convergence. It is remarkable that for this step the number of articles is very small. However, some articles recently published present very promising results by reducing the algorithm complexity without decreasing significantly the solution quality.