Open access peer-reviewed chapter

Partitional Clustering

By Uğurhan Kutbay

Submitted: October 27th 2017Reviewed: February 22nd 2018Published: August 1st 2018

DOI: 10.5772/intechopen.75836

Downloaded: 275

Abstract

People are living in a world full of data. Humans are collecting data from many measurements and observations in their daily works. The sorting of these numerous data is important and necessary in terms of analyzing, reasoning, and decision-making processes. For this reason, clustering has been used in many areas and has become very important in recent years. Feature selection and classifying the data in subsets can be changed data to data. As a result of these feature selection methods, some clustering methods have been revealed. Hierarchical clustering, partitional clustering, artificial system clustering, kernel-based clustering, and sequential data clustering are determined for different clustering strategies. This chapter examines some popular partitional clustering techniques and algorithms. Partitional clustering assigns a set of data points into k-clusters by using iterative processes. The predefined criterion function (J) assigns the datum into kth number set. As a result of this criterion function value in k sets (maximization and minimization calculation), clustering can be done. This chapter starts with criterion function for clustering process. In addition, some applications will be done for each algorithm in this chapter.

Keywords

  • partitional clustering
  • K-means
  • fuzzy clustering
  • C-FCM
  • genetic algorithm

1. Introduction

The choice of feature types and measurement levels depends on data type. For this reason, many clustering methods have been developed. According to clustering strategies, these methods can be classified as hierarchical clustering [1, 2, 3], partitional clustering [4, 5], artificial system clustering [6], kernel-based clustering and sequential data clustering. This chapter examines some popular partitional clustering techniques and algorithms.

In contrast to hierarchical clustering methods, partitional clustering aims successive clusters using some iterative processes. Partitional clustering assigns a set of data points into k-clusters by using iterative processes. In these processes, n data are classified into k-clusters. The predefined criterion function J assigns the datum into kth number set according to the maximization and minimization calculation in k sets.

Figure 1 represents the hierarchical clustering and partitional clustering. In addition, hierarchical clustering, all sub-clusters defined in another sub-cluster shown in Figure 1. Figure 1a represents the raw data, Figure 1b shows the partitional clustering and Figure 1c represents the hierarchical clustering. In hierarchical clustering, raw data are firstly clustered in some subgroups (three-clustered shape). After that procedure, subgroups hierarchically defined in two green clusters. Last procedure includes all these clusters which are defined in the union set.

Figure 1.

Clustering techniques: (a) data set; (b) partitional clustering; and (c) hierarchical clustering.

This chapter starts with an introduction to clustering criteria and continues with K-Means algorithm, different fuzzy clustering techniques and genetic algorithm-based clustering.

2. Clustering criteria

Clustering aims to seek a partition of the data in the same homogenous clusters [7]. This homogeneity and finding the exact clustering are evaluated only by using criterion functions. One of the most popular techniques of the criterion function is the summing of squared-error (SSE) criterion and similar methods are used such as mean-square-error (MSE), normalized mean-square-error (NMSE), and so on. Sum of squared error criterion can be defined as:

JΓΜ=i=1Kj=1Nγijxjmi2=i=1Kj=1NγijxjmiTxjmiE1

where Γis a partition matrix of γijand defined as;

γij=1ifxjclusteri0otherwiseE2

M is the cluster prototype or centroid matrix and midefined as;

mi=1Nii=1Nγijxjis the sample mean for the ith number cluster corresponding to Niobjects.

The partition minimizes the sum-square-error (SSE). When the SSE regarded is minimum, minimum variance partition will be achieved. As a result of this calculation, optimum cluster is determined.

3. K-means algorithm

The K-Means clusters were first developed by Mac Queen [8]. In the K-Means clusters, clusters are formed using Euclidean distance. In the K-Means algorithm, unsupervised learning is used and k classes are created which minimize the error function [9].

In K-Means clustering, k cluster centers are created from the selected data set. It is then placed at the nearest cluster using Euclidean distance. New cluster centers are formed according to the results of the clustering. From the calculations of the clustering, the cluster center is recalculated. The arithmetic average is used as the calculation method, and the new cluster center is determined. All samples are reclassified according to the new center. This process is repeated until it is determined that the samples in the set have not passed to another set.

The partitioning of the k pieces of data x is represented by the minimization of the J parameter as in (3).

J=minkxckwxdistxokE3

If the data are classified in a cluster near the center of the nearest cluster, the J value will be the minimum. If the data x are classified in the kth number cluster, the value can be optimized by changing the weighting value of wxto obtain the minimum J value. distxokis the notation which represents the distance function. In this formula, x represents the pixel data, okis the center of the cluster. k sets are shown as in (4).

ck=ck,1ck,2ck,nE4

The distance function can be calculated as in Euclidean distance for k clusters in the next formula (5).

distxjiok=xj,1iok,12+xj,1iok,22++xj,niok,n2E5

The iterative operations are repeated as in the flow diagram in Figure 2. This flowchart represents the iterative distance controlled operations to minimize the J parameter.

Figure 2.

Flow chart of the K-Means algorithm.

When this algorithm is applied for three-cluster for the original image shown in Figure 3a, this image can be represented in three clusters as in Figure 3b–d, respectively. Cluster-1 shows the sea for the given landscape, cluster-2 shows the the green areas in the landscape and cluster-3 shows the roads for the given landscape.

Figure 3.

K-means algorithm results. (a) Original image; (b) K-Means Cluster No:1; (c) K-Means Cluster No:2; and (d) K-Means Cluster No:3.

4. Fuzzy clustering

Fuzzy theory is firstly developed by Zadeh [10] for defining adjustable degrees of memberships. Fuzzy theory creates intermediate sets rather than classical sets. In classical sets, each data item is assigned into only one cluster. In contrast, data in fuzzy clusters can be represented in multiple clusters. This multiset assignment can belong to all the clusters with a certain degree of membership defined by Bezdek [11]. This one item in multiset representation can be useful for sharply separated cluster boundaries.

The fuzzy C-Mean algorithm (FCM) is frequently used because of its ease of operation and reliability in many applications [12, 13, 14, 15]. Conventional Fuzzy C-Mean (FCM) works with the principle of minimizing the objective function [16] shown in the following formula (6):

JFCM=i=1ck=1nuikmdik2E6

where uikmis the membership function in the range [0,1]. This membership represents the membership degree of xk for the kth pixel. k is defined in the range of k1nfor the ith numbered cluster. Total cluster size (c) given in the range of 1c. In the formula (7), dik2represents the distance between xk and ith-cluster center (vi).

dik2=xkviE7

The fuzzy set theory aims that the membership function is i=1cuik=1for each pixel. Using the FCM membership function uikmand cluster center vi, FCM targets to reach local minimums by using the equivalence (8) and (9), respectively.

uik=j=1cdikdjk2/m11i1candk1nE8
vi=k=1nuikmxkk=1nuikmi1cE9

FCM algorithm flow chart is shown in Figure 4. This algorithm will be done in the given iteration step specified in this traditional clustering method. The cluster centers are updated in each iteration step to calculate the membership function.

Figure 4.

Flow chart of the FCM algorithm.

When the FCM algorithm is applied for the given original image shown in Figure 5a for three clusters, this image can be represented in three clusters as in Figure 5b–d, respectively. This algorithm can only be applied for gray-scaled images. There is no crisp clustering for the given landscape image. This image includes many objects in different colors, and these colors are generally indented in these objects.

Figure 5.

FCM algorithm results. (a) Original image; (b) FCM Cluster No:1; (c) FCM Cluster No:2; and (d) FCM Cluster No:3.

FCM algorithm includes iterative procecesses. For the given image at the end of the 55 iteration steps, clustering processes are completed for the given threshold value which is 10−5 for FCM distance change.

5. Colored FCM

Colored Image Fuzzy C-Mean (C-FCM) involves color-based clustering using fuzzy sets. This 3D method is firstly given by Kutbay and Hardalaç [17] as Robust Colored Image FCM (RCI-FCM), but this presented method is different from RCI-FCM. In RCI-FCM, distances are calculated for each R, G and B channels, but in this method, the mean distance is calculated for RGB color spaces. This method represents the RGB color formed images in FCM, which uses FCM-based algorithm in colored images. The membership function calculates the centroids of the clusters for each R, G and B color spaces. After calculation of Euclidian distance for each channel, the mean distance could be calculated shown in equivalence (10).

dikRGB2=meanxkRviRxkGviGxkBviBE10

Membership for each cluster can be calculated in the following formula (11).

uikRGB=j=1cdikRGBdjkRGB2/m11i1candk1nE11

where uikRGBrepresents the membership degree for mean distance for each color. In this representation for each item, cluster’s membership can be calculated and statistically membership could be defined for each cluster.

New cluster center can be calculated in the following equivalence (12) for each cluster:

viRGB=k=1nuikRGBmxkRGBk=1nuikRGBmi1cE12

where uikRGBis the membership function of each RGB color pixel. This function calculates for each R, G and B colors, for the each RGB space representation. c denotes the clusters for each n-pixel, and dikRGBexplains the distance between pixel xkand the centroid vifor ith number cluster for each color domain. viRGBrepresents the centers of the clusters for each RGB pixel value into the c-cluster. C-FCM algorithm’s flow chart is shown in Figure 6. The proposed method aims to create c3 cluster for the given stopping criteria. New cluster centroids are calculated from the results of the membership function for all RGB colors.

Figure 6.

Flow chart of the C-FCM algorithm.

C-FCM algorithm flow chart is shown in Figure 6. The flowchart shows the C-FCM algorithm for colored images. For each iteration step, RGB distances are calculated and cluster centers are calculated. After updating the cluster centers in the given threshold value, each item will be assigned into the given cluster.

Figure 7 represents the C-FCM algorithm, which is applied for the given original image shown in Figure 7a. For each color, pixels assigned into two cluster. For RGB color space 23 cluster are created for this image. These clusters can be given in eight clusters as in Figure 4b–i, respectively.

Figure 7.

C-FCM algorithm results. (a)Original image; (b)–(i). C-FCM Cluster No:1–8.

C-FCM algorithm includes iterative procecesses. For the given image at the end of the 26 iteration steps, clustering processes are completed for the given threshold value which is 10−5 for C-FCM distance change.

6. Genetic clustering

Genetic algorithm is very popular method in evolutionary computation processes. This method is firstly developed by Holland in 1975 [18]. This algorithm includes natural evolutionary processes. This method optimizes a population of the structure by using a set of evolutionary operators.

This method maintains a population of structures and these structures consisting of individuals. Each individual is evaluated by a function named as fitness function. These processes includes selection, recombination and mutation processes.

In genetic algorithms (GAs), each individual represents a candidate solution in binary form. This individual called as chromosome. After an initial population is generated, randomly crossover and mutation processes are executed for each iteration step.

For genetic algorithm examination, the following terms are useful for describing the concept of genetic algorithms. These are gene, chromosome, population (mass), reproduction process, and conformity value.

Gen is a unit that carries partial information. By bringing together these units, the chromosomal sequence that forms the solution cluster comes into play; for this reason, the genome decides well how to code it.

Chromosomes are structures that contain the information of the problem solving. Population is formed by the combination of chromosomes. At the initiative of the designer, what information is to be found on the chromosome.

The population is called the heap of possible solutions. In the GA process, the population size remains constant, but the bad chromosomes separate from the stack. The size of the heap is a very important concept, which must be well established, as the overcrowded heap increases the time of possible heuristic approach, while the small heap may cause no possible solution at all.

The reproduction process is the process of selecting the sequences to be transferred from the current stack to the next stack. The sequences carried are genetically the most appropriate sequences. The requirement for transition is whether the level of conformity specified has been achieved.

The fitness value, in genetic algorithms, which specifies which index will transfer the next generation, which index will be lost. Conformity value reflects the purpose of the problem.

Bandyopadhyay and Maulik [19] attempted to use GA to automatically determine the number of clusters K in 2002. The GA clustering aims assigning the data into kth cluster using genetic processes.

Showing the basic concept of the GA-based clustering, genetic guided algorithm [20] is used. Fitness value is represented in fuzzy clusters as shown in equivalent (13);

jM=j=1Ni=1KDij1/1m1mE13

where Diji1Kandj1N. Dijrepresent the distance between the ith cluster prototype vector and the data object of the number of j. m represents the fuzzification coefficient. Calculation process of the genetic algorithm is shown in Figure 8.

Figure 8.

Flow chart of the genetic clustering.

In this genetic parameter optimized algorithm, firstly P individuals are initialized, and each individual represents K x d prototype matrix encoded as gray codes. After this prototype matrix representation, fitness values are calculated. After this calculated fitness value, tournament selection is used for parental member reproduction. For generating new parents, two-point crossover and bitwise mutation are done. For these new individuals, the highest fitness values are obtained using this fitness equation. All these processes are applied until termination condition (Dij < δ) is satisfied.

Chromosomes are randomly selected, and best parent are selected in tournament selection process. After tournament selection process, two-point crosover process and bitwise mutuation are done, respectively. These processes are shown in Figures 9 and 10, respectively.

Figure 9.

Two-point crossover process.

Figure 10.

Mutation process.

Figure 9 represents crossover process. In two-point crossover process, bitwise crossover points are determined. After determination process, crossover process will be done.

There are some crossover processes are used in different works [21, 22, 23]. These types are 1-point crossover, K-point crossover, shuffle crosover, and so on [24].

Mutation process helps the genetically variations from parent chromosomes to child chromosomes. There are many mutation operators defined in the literature. These are bit-flip-mutation, swap-mutation and inversion-mutation [25]. In mutation process, mutation points are given. In this example, three mutation points are determined shown in Figure 10. After mutation process, mutation points will be changed. If the mutation rate is high, the main generation may be lost. In general terms, the rate of mutation in GA is between given 0.05 and 0.15% of the parent.

GA-based fuzzy clustering process in four clusters shown in Figure 11. GA is applied for the given original image shown in Figure 11a for four clusters. This image can be represented in four clusters as in Figure 11b–e, respectively. This algorithm can only be applied for gray-scaled images. Cluster-1 shows the roads for the given landscape, cluster-2 shows the green areas in the landscape, cluster-3 shows the sea for the given landscape and cluster-4 represents the air for the given landscape.

Figure 11.

Genetic algorithm results. (a) Original image, (b) Cluster No:1, (c) Cluster No:2, (d) Cluster No:3, and (e) Cluster No:4.

7. Conclusion

In this part, we discussed the fundamental partitional clustering algorithms and its applications. Partitional clustering produces a partition using K clusters but do not use hierarchical structures. This type of clustering process uses criterion function in a range of error such as mean square error, sum of squared error, and so on. For this reason, partitional clustering is an iterative process.

Some search techniques which include evolutionary process, provide seeking the global or local optima. Search techniques introduce more parameters than K-Means clustering process. There is no theoretical approval for selecting of the best approach. All the given methods can give the best results for different data.

The partitional clustering algorithms need high computational requirements. High computational requirements means that some complex algorithms need advanced operations. These requirements limit their applications in large data. Overcoming this situation, high speed and high-capacity memory-sized computers can be used for clustering processes.

Another method for overcoming high computational requirements is the genetic optimization. For large datasets, genetic optimization should be used with fuzzy clustering. In this technique, all individuals or pixels are genetically optimized so as to fuzzy sets.

Clusters cannot be always sharply separated in some datasets. For this type of data sets, fuzzy-based clustering overcomes this crisp clustering. Fuzzy clustering allows to find the nearest optimum cluster to assign into the clusters. This technique provides more information about the data structure.

The most important point of the search techniques of the partitional clustering is the optimum parameter selection. Parameter selection is an optimization problem. Overcoming this optimization problem, parameter selection can be done by using genetic algorithms. Genetic algorithms can be useful solution for very-large scale data sets.

In partitional clustering, determination of cluster size is important. This selection differs from the data sets to data sets. If data set includes more features to classify in a cluster, more clusters will be needed. This cluster number unfortunately is not known for many clustering problems. Generally experiences give the cluster number. Estimation of the cluster number is one of the major problems for validation.

As a result, by using partitional clustering techniques, image understanding can be done by using the given techniques. For many applications such as biomedical image understanding, robotic applications and security applications, these techniques can be useful for pre-processing of some algorithms.

Acknowledgments

This study was performed at Gazi University, Engineering Faculty, Electrical & Electronics Engineering Department. MATLAB® 2016b software platform was used for this study. The study was performed in a computer with 12 GB RAM, Intel i5 processor at 2.6 Ghz.

Competing interests

The author declares having no conflicts of interests.

Thanks

I would like to thank my wife Hande Kutbay who has given me a lot of support in my life. I thank my daughter, Gökçe Kutbay, who makes life difficult but makes sense. Special thanks to my mother Cemile KUTBAY and my father Hür Uğur KUTBAY for being in my life. In addition, I would like to thank Assoc. Prof. Dr. Fırat Hardalaç due to academic help.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Uğurhan Kutbay (August 1st 2018). Partitional Clustering, Recent Applications in Data Clustering, Harun Pirim, IntechOpen, DOI: 10.5772/intechopen.75836. Available from:

chapter statistics

275total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Incorporating Local Data and KL Membership Divergence into Hard C-Means Clustering for Fuzzy and Noise-Robust Data Segmentation

By Reda R. Gharieb

Related Book

First chapter

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective

By T.K. Attwood, A. Gisel, N-E. Eriksson and E. Bongcam-Rudloff

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us