Open access

Ensemble Clustering for Biological Datasets

Written By

Harun Pirim and Şadi Evren Şeker

Published: 28 November 2012

DOI: 10.5772/49956

From the Edited Volume

Bioinformatics

Edited by Horacio Pérez-Sánchez

Chapter metrics overview

3,265 Chapter Downloads

View Full Metrics

1. Introduction

Recent technologies and tools generated excessive data in bioinformatics domain. For example, microarrays measure expression levels of ten thousands of genes simultaneously in a single chip. Measurements involve relative expression values of each gene through an image processing task.

Biological data requires both low and high level analysis to reveal significant information that will shed light into biological facts such as disease prediction, annotation of a gene function and guide new experiments. In that sense, researchers are seeking for the effect of a treatment or time course change befalling. For example, they may design a microarray experiment treating a biological organism with a chemical substance and observe gene expression values comparing with expression value before treatment. This treatment or change make researchers focus on groups of genes, other biological molecules that have significant relationships with each other under similar conditions. For instance, gene class labels are usually unknown, since there is a little information available about the data. Hence, data analysis using an unsupervised learning technique is required. Clustering is an unsupervised learning technique used in diverse domains including bioinformatics. Clustering assigns objects into the same cluster, based on a cluster definition. A cluster definition or criterion is the similarity between the objects. The idea is that one needs to find the most important cliques among many from the data. Therefore, clustering is widely used to obtain biologically meaningful partitions. However, there is no best clustering approach for the problem on hand and clustering algorithms are biased towards certain criteria. In other words, a particular clustering approach has its own objective and assumptions about the data.

Diversity of clustering algorithms can benefit from merging partitions generated individually. Ensemble clustering provides a framework to merge individual partitions from different clustering algorithms. Ensemble clustering may generate more accurate clusters than individual clustering approaches. Here, an ensemble clustering framework is implemented as described in [10] to aggregate results from K-means, hiearchical clustering and C-means algorithms. We employ C-means instead of spectral clustering in [10]. We also use different data sets. Two different biological datasets are used for each algorithm. A comparison of the results is presented. In order to evaluate the performance of the ensemble clustering approach, one internal and one external cluster validation indices are used. Silhouette (S) [31] is the internal validation index and C-rand [23] is the external one. The chapter reviews some clustering algorithms, ensemble clustering methods, includes implementation, and conclusion sections.

Advertisement

2. Clustering algorithms

Clustering biological data is very important for identification of co-expressed genes, which facilitates functional annotation and the elucidation of biological pathways. Accurate predictions can serve as a guide for targeting further experiments and generating additional hypotheses. Furthermore, accurate predictions can facilitate identification of disease markers and targets for drug design [4]; clustering can also be used to determine whether certain patterns exist near viral integration sites[16].

Current algorithms used in gene clustering have some drawbacks. For example, K-means algorithm is sensitive to noise that is inherent in gene expression data. In addition, the solution (i.e. the final clustering) that the K-means algorithm finds may not be a global optimum since it relies on randomly chosen initial objects. However, K-means-based methods are prevalent in the literature such as [12, 17, 33]. K-means works upon randomly chosen centroid points that represent the clusters. The objects are assigned to the closest clusters based on distance calculation regarding centroid points. For example, the dataset illustrated in Figure 1 is assigned two centroids.

Figure 1.

The dataset and two centroid points

The distance between any object from the dataset to both of the centroid points are calculated and the objects are assigned to the closest cluster represented by the closest centroid point as seen in Figure 2. Then new centroid points of clusters are calculated and objects are assigned to the closest clusters regarding the distance to new centroid points. Recalculation of centroid points and assignment of objects to new clusters goes on till centroids points remain the same as in Figure 3.

Another method, Self-organizing Map (SOM), is one of the machine-learning techniques widely used in gene clustering. A recent study is [14]. SOM requires a grid structured input that makes it ineffective.

Figure 2.

Initial clusters

Figure 3.

Iteration of K-means

Hierarchical clustering (HC) algorithms are also widely used and area of two types: agglomerative and divisive. In agglomerative approach objects are all in different clusters and they are merged till they are all in the same cluster as seen in Figure 4. Two important drawbacks of the HC algorithms are that they are not robust and they have high computational complexity. HC algorithms are "greedy" which often means that the final solution is suboptimal due to locally optimal choices being made in initial steps, which turn out to be poor choices with respect to the global solution. A recent study is [26].

Graph-theoretical clustering techniques exist in which the genomic data are represented by nodes and edges of a graph. Network methods have been applied to identify and characterize various biological interactions [13]. Identification of clusters using networks is often intractable, that is finding an optimal partition of a graph is an NP-hard problem [1]. NP-hard is a class of problems that are at least as hard as NP-complete problems. NP-complete is a class of problems that are in NP and reducible to an NP-complete problem in polynomial time. Some examples of graph theory-based clustering approaches are: [30] and [24].

Figure 4.

Agglomerative approach

Model-based clustering approaches are the ones using probability distributions to predict the distribution of gene expression data. However, gene expression data does not have a unique distribution. Some examples are given in [19] and [34].

Sub-space clustering (biclustering) methods, which employ the reasoning that one gene may belong to multiple pathways or no pathways are also used in the literature as in [28]. There are also optimization-based algorithms as in [15], spectral algorithms as in [25], fuzzy algorithms as in [32], meta-heuristics as in [18] used for clustering genomic data.

Advertisement

3. Ensemble clustering

Combining diverse partitions from different clustering algorithms may result in high quality and robust clusters, since ensemble approaches such as bagging and boosting used in classification problems have proven to be effective [22]. The fact that the objects have various features makes it difficult to find an optimal clustering of similar objects. In other words, objects may be classified based on different features such as size, color, and age. In that sense, ensemble clustering is a promising heuristic combining results based on different features.

Figure 5 represents a clustering ensemble framework. CAs are clustering algorithms, Ps are partitions generated by them, N is number of clustering algorithms and partitions FC is the consensus function and CP is the consensus partition.

Figure 5.

Ensemble clustering framework

Ensemble clustering requires the following tasks [2]:

  1. selection of base clustering algorithms

  2. definition of a consensus function

  3. merging of individual partitions by the chosen consensus function

[2] apply an ensemble approach for clustering scale-free graphs. They use metrics based on the neighborhood which uses the adjacency list of each node and considers the nodes as having several common neighbors, the clustering coefficient, and the shortest path betweenness of nodes in the network. The scale-free graph used in the study is from a budding yeast PPI network that contained 15147 interactions between 4741 proteins. It is reported that ensemble clustering can provide improvements in cluster quality for scale-free graphs based upon the preliminary results. [3] propose an ensemble clustering framework to extract functional modules that are relevant biologically in protein-protein interaction (PPI) networks. Their method attempts to handle the noisy false positive interactions and specific topological interactions present in the network. The method uses graph clustering algorithms, repeated bisections, direct k-way partitioning, and multilevel k-way partitioning, to obtain the base partitions. The method utilizes two topological distance matrices. One of the distance matrices is based on the clustering coefficient [36], and the other distance matrix is generated using the betweenness measure [29]. The proposed study demonstrates a soft ensemble method such that proteins are allowed to be assigned to more than one cluster. Empirical evaluation of the different ensemble methods in the study shows the superior performance of the proposed ensemble framework.

Fuzzy clustering algorithms are widely used with well-understood properties and benefits in various applications. Nevertheless, there has been very little analysis of using fuzzy clustering algorithms in regards to generating the base partitions in cluster ensembles. [35] compares hard and fuzzy C-means [7] algorithms in the well-known evidence-accumulation framework of cluster ensembles. In the study, it is observed that the fuzzy C-means approach requires much fewer base partitions for the cluster ensemble to converge, and is more tolerant of outliers in the data.

[5] propose a fuzzy ensemble clustering approach to address the issue of unclear boundaries between the clusters from the biological and biomedical gene expression data analysis. The approach takes into account their inherent fuzziness. The goal of the study is improving the accuracy and robustness of clustering results. After applying random projections to obtain lower dimensional gene expression data, the method applies the fuzzy K-means algorithm on the low dimensional data to generate multiple fuzzy base clusters. Then, the fuzzy clusters are combined using a similarity matrix where the elements of the matrix are generated by the fuzzy t-norms algorithm, and finally, the fuzzy K-means algorithm is applied to the rows of the similarity matrix to obtain the consensus clustering. It is demonstrated that the proposed ensemble approach is competitive with the other ensemble methods.

High throughput data may be generated by microarray experiments. If the dataset is very large, it is possible to generate an ensemble of clustering solutions, or partition the data so that clustering may be performed on tractable-sized disjoint subsets [20]. The data can then be distributed at different sites, for which a distributed clustering solution with a final merging of partitions is a natural fit. [20] introduce two new approaches to combining partitions represented by sets of cluster centers. It is stated that these approaches provide a final partition of data that is comparable to the best existing approaches and that the approaches can be 100,000 times faster while using much less memory. The new algorithms are compared with the best existing cluster ensemble approaches that cluster all of the data at once, and a clustering algorithm designed for very large datasets. Fuzzy and hard K-means based clustering algorithms are used for the comparison. It is demonstrated that the centroid-based ensemble merging algorithms presented in the study generated partitions which are as good as the best label vector method, or the method of clustering all the data at once. The proposed algorithms are also more efficient in terms of speed.

[11] propose evidence accumulation clustering based on dual rooted prim tree cuts (EAC-DC). The proposed algorithm computes the co-association matrix based on a forward algorithm that repeatedly adds edges to Prim’s minimum spanning tree (MST) to identify clusters until a satisfying criterion is met. A consensus cluster is then generated from the co-association matrix using spectral partitioning. Here, a MST is a fully connected sub-graph with no cycles and a dual-rooted tree is obtained by finding the union of two sub-trees. They test their approach using the Iris dataset [8], the Wisconsin breast cancer dataset [27] (both obtained from [9]) and synthetic datasets, and presented a comparison of their results with other existing ensemble clustering methods.

[22] use a cluster ensemble in gene expression analysis. In the proposed ensemble framework, the partitions generated by each individual clustering algorithm are converted into a distance matrix. The distance matrices are then combined to construct a weighted graph. A graph partitioning approach is then used to generate the final set of clusters. It is reported that the ensemble approach yields better results than the best individual approach on both synthetic and yeast gene expression datasets.

[10] merge multiple partitions using evidence accumulation. Each partition generated by a clustering algorithm is used as a new piece of knowledge, to help uncover the relationships between objects. For this chapter, we adopt their ensemble approach. The core idea behind the ensemble approach here is constructing the co-association matrix by employing a voting mechanism for the partitions generated using individual clustering algorithms. A co-association matrix C is constructed based upon the formulation below, where nijis the number of times the object pair (i, j) is assigned to the same cluster among the N different partitions:

C(i,j)=nijNE1

After constructing the co-association matrix, [10] use single linkage hierarchical clustering to obtain the new cluster tree (dendrogram) and then use a cut-off value corresponding to the maximum life time (difference between merge points where branching starts) on the tree.

They also employ the same ensemble framework using K-means partitions with different parameters. They test their algorithms on ten different datasets, comparing the results with other ensemble clustering methods. They report that their ensemble approach can identify the clusters with arbitrary shapes and sizes, and perform better than the other combination methods.

Advertisement

4. Implementation

We employ the ensemble approach described in [10]. Different set of base clustering algorithms are chosen and implemented on protein and lymphoma datasets.

Protein dataset consists of 698 objects (corresponding to protein folds) with 125 attributes. The protein dataset contains 698 proteins from 125 samples. The real clusters correspond to the four classes of protein - folds: α, β, α/β and α+β protein classes. DLBCL - B is 2 - channel custom cDNA microarray dataset. This is a B cell lymphoma dataset with predefined three subtypes [21].

The ensemble clustering algorithm uses an array of vectors data structure for each of the file, in order to use the dynamic memory allocation and starts with initializing the file content in the vectors. The algorithm also processes the vectors and generates two temporary matrices with the dimension of maximum vector length. The ensemble clustering algorithm steps are as follows:

Here, n is the number of files, V[n] are the vectors holding the content of each file. max(V[n]) is the length of the longest vector, C[i][j] is the co-association matrix and D[i][j] is the distance matrix. The algorithm iterates through the two dimensional matrix via iand j loop variables inside a nested loop at lines 1 and 2 and for each member of the matrix, all the vectors are processed inside the loop via k loop variable at line 3. The condition of equality for the selected vector with the selected loop variables iand j, causes an increase on the co-association matrix elements at lines 4 and 5. Finally the distance matrix is calculated at line 7. After obtaining the distance matrix, hierarchial clustering with complete linkage is used to generate the dengrogram. The dendrogram is cut at a certain level to obtain consensus partition.

Ensemble approach is coded as a java application which is available upon request. The software allows addition of many partitions to generate the distance matrix of the corresponding ensemble. Files including the partitions can be added by clicking on the "Add File" button as seen in Figure 6. Distance matrix of the ensemble is generated by "Calculate" button.

Figure 6.

File input interface

The output is displayed on a separate screen as demonstrated on Figure 7. The output with csv format can be written into a file by clicking on the "Output CSV" button.

Figure 7.

Example clusters

Considering two different partitions of a dataset with six objects which are (1, 1, 2, 1, 3, 3) and (2, 2, 2, 1, 3, 3), the algorithm’s output is the distance matrix:

The distance matrix is used in hierarchical clustering with complete linkage and the following dendrogram is generated. The dendrogram is cut at a level to give three clusters. The corresponding partition is (1, 1, 1, 2, 3, 3) which is the same as second partition (2, 2, 2, 1, 3, 3).

Figure 8.

Example clusters

We employ hierarchical clustering, K-means and C-means to obtain base partitions. K-means and hierarchical clustering algorithm are implemented using R base package, C-means is implemented using R e1071 package. Silhouette and C-rand indices are utilized to evaluate the performance of individual and ensemble algorithms. Silhouette and C-rand values are calculated using R clusterSim and flexclust packages respectively. Silhouette is an internal measure of compactness and separation of clusters [6]. The silhouette index values are between -1 and 1 representing worst and best values. C-rand is an external measure of agreement between two partitions. C-rand has maximum value of 1 and it can take negative values. The silhouette and C-rand values found by the base and ensemble algorithms are given in Table 1. Ensemble approach improves clustering result both for the protein and DLBCL-B datasets. Ensemble approach finds better C-rand value, 0.157 than values by K-means and C-means, 0.127 for the protein dataset. Ensemble approach also finds the best C-rand value, 0.135 compared to values generated by individual clustering algorithms, 0.021, 0.063, 0.098. However, the ensemble approach makes S values worse in most cases.

DatasetMethodNum. of clustersS valueC value
ProteinHC
K-means C-means Ensemble
4
4
4
4
0.344
0.379
0.379
0.078
0.199
0.127
0.127
0.157
DLBCL-BHC
K-means C-means Ensemble
3
3
2
3
-0.034
-0.015
-0.005
-0.017
0.021
0.063
0.098
0.135

Table 1.

Index values for base and ensemble algorithms

Advertisement

5. Conclusion

Clustering groups of objects such that similar ones are placed in the same cluster, and in its application to biological datasets are very important in that it can help identification of natural groups of biological entities that might give insight about biomarkers. In this chapter, we review some clustering algorithms applied to biological data. Ensemble clustering approaches for biological data are also reviewed. Implementation of K-means, C-means and HC algorithms and merging of the algorithms using an ensemble frame work are presented using two different datasets. The datasets are protein and DLBCL-B. Two different cluster validation indices, adjusted rand and silhouette, are used for comparing the partitions from individual algorithms and ensemble clustering. Investigating Table 1, we conclude that merging individual partitions improves C-rand values meaning that ensemble approach finds partitions similar to the real partitions. Ensemble approach is coded as a Java application and available upon request.

Acknowledgement

Authors thank DilipGautam for his contribution to this chapter. The work of ŞadiEvrenŞeker was supported by Scientific Research Projects Coordination Unit of Istanbul University, project number YADOP-16728.

References

  1. 1. AroraSRaoSVaziraniU2009Expander flows, geometric embeddings and graph partitioningJournal of the ACM562137URL: http://doi.acm.org/10.1145/1502793.1502794
  2. 2. AsurSParthasarathySUcarD2006An Ensemble Approach for Clustering Scale-Free Graphs, KDD-2006Workshop on Link Analysis, 12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining.
  3. 3. AsurSUcarDParthasarathyS2007An ensemble framework for clustering protein–protein interaction networks, Bioinformatics 23132940URL: http://dx.doi.org/10.1093/bioinformatics/btm212
  4. 4. AsyaliM. HColakDDemirkayaOInanM. S2006Gene expression profile classification: A reviewCurrent Bioinformatics5573URL: http://dx.doi.org/10.2174/157489306775330615
  5. 5. AvogadriRValentiniG2009Fuzzy ensemble clustering based on random projections for dna microarray data analysisArtificial Intelligence in Medicine 45(2-3): 173-183. URL: http://dx.doi.org/10.1016/j.artmed.2008.07.014
  6. 6. BandyopadhyaySMukhopadhyayAMaulikU2007An improved algorithm for clustering gene expression data,Bioinformatics 232128592865
  7. 7. BezdekJ. C1981Pattern Recognition with Fuzzy Objective Function AlgorithmsKluwer Academic Publishers, Norwell, MA, USA.
  8. 8. FisherR. A1936The use of multiple measurements in taxonomic problemsAnnals Eugen.7179188
  9. 9. FrankAAsuncionA2010UCI machine learning repository. URL: http://archive.ics.uci.edu/ml
  10. 10. FredA. L. NJainA. K2005Combining multiple clusterings using evidence accumulation,IEEE Transaction on Pattern Analysis and Machine Intelligence 27835850
  11. 11. GalluccioLMichelJ. JOComonPHeroA. OKligerM. [2009Combining multiple partitions created with a graph-based construction for data clusteringProceedings of IEEE International Workshop on Machine Learning for Signal ProcessingGrenoble, France, pp.-.
  12. 12. GeraciFLeonciniMMontangeroMPellegriniMRendaM. E2009K-boost: a scalable algorithm for high-quality clustering of microarray gene expression dataJournalof Computational Biology 166859873
  13. 13. GhazalpourADossSZhangBWangSPlaisierCCastellanosRBrozellASchadtE. EDrakeT. ALusisA. JHorvathS2006Integrating genetic and network analysis to characterize genes related to mouse weightPLoS Genetics 2(8). URL: http://dx.plos.org/10.1371
  14. 14. GhouilaAYahiaS. BMaloucheDJmelHLaouiniDGuerfaliF. ZAbdelhakS2009Application of multi-som clustering approach to macrophage gene expression analysisInfectionGenetics and Evolution 93328336
  15. 15. GloverF. WKochenbergerG2006New optimization models for data miningInternational Journal of Information Technology and Decision Making 54605609
  16. 16. GumusEKursunOSertbasAUstekD2012Application of canonical correlation analysis for identifying viral integration preferences,Bioinformatics 285651655
  17. 17. GungorZUnlerA2008K-harmonic means data clustering with tabu-search methodApplied Mathematical Modelling32611151125
  18. 18. HeYHuiS. C2009Exploring ant-based algorithms for gene expression data analysisArtificial Intelligence in Medicine472105119
  19. 19. HeathJ. WFuM. CJankW2009New global optimization algorithms for model-based clusteringComputational Statistics and Data Analysis 531239994017
  20. 20. HorePHallL. OGoldgofD. B2009A scalable framework for cluster ensemblesPattern Recognition425676688URL: http://dx.doi.org/10.1016/j.patcog.2008.09.027
  21. 21. HoshidaYBrunetJ. PTamayoPGolubT. RMesirovJ. P2007Subclass mapping: Identifying common subtypes in independent disease data setsPLoS ONE 2(11): 1195. URL: http://dx.plos.org/10.1371
  22. 22. HuXYooI2004Cluster ensemble and its applications in gene expression analysis, Proc. 2nd conference on Asia-Pacific bioinformatics (APBC’04), Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 297302
  23. 23. HubertLArabieP1985Comparing partitions, Journal of Classification 2193218BF01908075. URL: http://dx.doi.org/10.1007/BF01908075
  24. 24. HuttenhowerCFlamholzA. ILandisJ. NSahiSMyersC. LOlszewskiK. LHibbsM. ASiemersN. OTroyanskayaO. GCollerH. A2007Nearest neighbor networks: clustering expression data based on gene neighborhoodsBMC Bioinformatics 8250113
  25. 25. KimJChoiS2006Semidefinite spectral clusteringPattern Recognition3920252035
  26. 26. LangfelderPZhangBHorvathS2008Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r,Bioinformatics Applications Note 245719720
  27. 27. MangasarianO. LWolbergW. H1990Cancer diagnosis via linear programming235118
  28. 28. MitraSDasRBankaHMukhopadhyayS2009Gene interaction- an evolutionary biclustering approach, Information Fusion 10242249
  29. 29. NewmanM. E. JGirvanM2004Finding and evaluating community structure in networks,Physical ReviewE 69026113115
  30. 30. PhanVGeorgeE. OTranQ. TGoodwinS2009Analyzing microarray data with transitive directed acyclic graphs,Journal of Bioinformatics and Computational Biology71135156
  31. 31. RousseeuwP1987Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,Journal of Computational and Applied Mathematics 2015365URL: http://dx.doi.org/10.1016/0377-0427(87)90125-7
  32. 32. SahaSBandyopadhyayS2009A new point symmetry based fuzzy genetic clustering technique for automatic evolution of clustersInformation Sciences1791932303246
  33. 33. TsengG. C2007Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data,Bioinformatics 231722472255
  34. 34. WangSZhuJ2008Variable selection for model-based high-dimensional clustering and its application to microarray dataBiometrics642440448
  35. 35. WangT2009Comparing hard and fuzzy c-means for evidence-accumulation clustering, Proceedings of the 18th international conference on Fuzzy Systems, FUZZ-IEEE’09, IEEE Press, Piscataway, NJ, USA, 468473URL: http://dl.acm.org/citation.cfm?id=1717561.1717643
  36. 36. WattsD. JStrogatzS. H1998Collective dynamics of’small-world’ networks,Nature3936684440442URL: http://dx.doi.org/10.1038/30918

Written By

Harun Pirim and Şadi Evren Şeker

Published: 28 November 2012