Open access peer-reviewed chapter - ONLINE FIRST

Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq

By Ismail Jamail and Ahmed Moussa

Submitted: May 31st 2020Reviewed: September 16th 2020Published: December 23rd 2020

DOI: 10.5772/intechopen.94069

Downloaded: 125


Latest developments in high-throughput cDNA sequencing (RNA-seq) have revolutionized gene expression profiling. This analysis aims to compare the expression levels of multiple genes between two or more samples, under specific circumstances or in a specific cell to give a global picture of cellular function. Thanks to these advances, gene expression data are being generated in large throughput. One of the primary data analysis tasks for gene expression studies involves data-mining techniques such as clustering and classification. Clustering, which is an unsupervised learning technique, has been widely used as a computational tool to facilitate our understanding of gene functions and regulations involved in a biological process. Cluster analysis aims to group the large number of genes present in a sample of gene expression profile data, such that similar or related genes are in same clusters, and different or unrelated genes are in distinct ones. Classification on the other hand can be used for grouping samples based on their expression profile. There are many clustering and classification algorithms that can be applied in gene expression experiments, the most widely used are hierarchical clustering, k-means clustering and model-based clustering that depend on a model to sort out the number of clusters. Depending on the data structure, a fitting clustering method must be used. In this chapter, we present a state of art of clustering algorithms and statistical approaches for grouping similar gene expression profiles that can be applied to RNA-seq data analysis and software tools dedicated to these methods. In addition, we discuss challenges in cluster analysis, and compare the performance of height commonly used clustering methods on four different public datasets from recount2.


  • clustering
  • classification
  • RNA-seq
  • gene expression
  • adjusted Rand index
  • machine learning
  • deep learning

1. Introduction

In recent years, RNA-seq based on Next generation Sequencing has become an attractive alternative for conducting quantitative analysis of gene expression. This approach offers a number of advantages compared to microarray analysis such as the discovery of novel RNA species (RNA-seq is not limited by prior knowledge of the genome of the organism, it can be used for the detection of novel transcripts), the higher sensitivity for genes expressed either at low or very high level and the unbiased approach compared to microarrays that are subject to cross-hybridization bias. Overall, RNA-seq is a better technique for many applications such as novel gene identification, differential gene expression, and splicing analysis.

The principle of RNA-seq is based on high-throughput next generation sequencing (NGS) technologies. The first step in the technique involves converting the population of RNA to be sequenced into cDNA fragments with adaptors attached to one or both ends, each molecule is then sequenced to obtain either single end short sequence reads or paired end reads [1]. These reads are stored in fastq files formats and consist of raw data for many analysis pipelines (Figure 1).

Figure 1.

RNA sequencing.

The primary objective of this chapter is to present algorithms for clustering gene expression data from RNA-seq. Therefore, in the first section, we will describe the different steps of the gene expression analysis workflow from preprocessing the raw reads to gene expression clustering and classification. In the second part of the chapter we will describe traditional, model-based and machine learning clustering methods for gene expression data, then we will conclude this chapter with a study for clustering samples of four public datasets from recount2, using different clustering methods and also evaluating the performance of each one using the adjusted rand index (RDI) and accuracy.

2. RNA-seq data analysis

RNA-seq has become a common tool for scientists to study the transcriptome complexity, and a convenient method for the analysis of differential gene expression. A typical RNA-seq data analysis workflow starts by preprocessing raw reads for contamination removal and quality control checks. The following step is to align the reads to a reference genome, or to make a de novo assembly if there is not any. Following the alignment, the quantification step aims to quantify aligned reads to produce a count matrix to use as entry data for Differential Expression (DE) analysis. Normalization and DE analysis normally go together as most of the methods have built-in normalization and accept only raw count matrix. For this study, we are more interested in the clustering step, we will perform Normalization of the raw counts separately and do the clustering without going through differential gene expression analysis. In the following section we describe with more details each step of the pipeline (Figure 2).

Figure 2.

RNA-seq data analysis workflow.

2.1 Preprocessing

Preprocessing raw reads consist of checking the quality of the reads, adapters trimming, removal of short reads and filtering bad quality bases. Tools like FastQC can generate a report summarizing the overall quality of the sequence information [2]. Based on this report we can determine how the quality trimming should be set up. Trimmomatic is one of many tools used to clean up the raw data. It can be used to remove adapters from the reads, trim off any low-quality bases at the ends of reads, and filter short reads that can align to multiple locations on the reference genome. Once the trimming step is done, it is a good practice to recheck the quality of the reads by rerunning FastQC.

2.2 Alignment

Now that we have explored the quality of our raw reads, we can move on to read alignment. Read alignment is one of the first steps required for many different types of analysis. It aims to map the huge number of short RNA sequences generated by NGS instruments (reads) to a reference genome in order to identify the correct genomic loci from which the read originated. In RNA-seq, alignment is a major step for the calculation of transcript or gene expression levels; several splice-aware alignment methods have been developed for RNA-seq experiments such as STAR, HISAT2 or TopHat. These aligners are designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments [3, 4, 5].

2.3 Quantification

Quantification of gene expression is to count the number of reads that map to each gene using methods such as HTSeq-count, FeatureCounts or kallisto [6, 7, 8]. This step is crucial if we want to do a gene differential expression analysis, which means to identify genes (or transcripts), if any, that have a statistically significant difference in abundance across the experimental groups or conditions.

2.4 Normalization

The read counts generated in the quantification step need to be normalized to make accurate comparisons of gene expression between samples or when doing an exploratory data analysis. Several normalization methods are used for this purpose such as CPM (counts per million), TPM (transcripts per kilobase million), RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped), DESeq2’s median of ratios and EdgeR’s trimmed mean of M values (TMM) [9].

2.5 Clustering

Cluster analysis techniques have proven to be helpful to understand gene expression data by uncovering unknown relationships among genes and unveiling different subtypes of diseases when it comes to clustering biological samples [10]. In the following section, we present methods for sample-based and gene-based clustering, starting with traditional methods used after data transformation then model-based clustering for data generated using a combination of probability distributions (Figure 3).

Figure 3.

Cluster analysis of RNA-sequencing data.

3. Clustering methods for gene expression data

3.1 Data transformation methods

Traditional clustering algorithms like hierarchical clustering and k-means cannot be directly applied to RNA-seq count data, to apply these methods for cluster analysis of RNA-seq data, that tend to follow an over-dispersed Poisson or negative binomial distribution, we need to transform the data in order to have a distribution closer to the normal distribution. In the following section, we present popular methods for data transformation:

  • Logarithmic, widely used method to deal with skewed data in many research domains, often used to reduce the variability of the data and make the data conform more closely to the normal distribution. However, it was demonstrated in [11], that in most circumstances the log transformation does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed.

  • Variance stabilizing transformation: This method was used to transform microarray data to stabilize the asymptotic variance over the full range of the data [12].

  • Eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNA-seq data analysis were proposed in [13], these methods deal with the two common properties when it come to the count matrix generated in the quantification step, Sparsity and Skewness; Sparsity means that many counts in the count matrix are zero. Skewness means that the histogram of all counts in the count matrix is usually skewed.

3.2 Clustering methods based on normal distribution

3.2.1 Hierarchical methods

Hierarchical clustering method is the most popular method for gene expression data analysis. In hierarchical clustering, genes with similar expression patterns are grouped together and are connected by a series of branches (clustering tree or dendrogram). Experiments with similar expression profiles can also be grouped together using the same method. This clustering technique is divided into two types: agglomerative and divisive. In an agglomerative or bottom-up clustering method each observation is assigned to its own cluster. In a comparative study on Cancer data [14], three variants of Hierarchical Clustering Algorithms (HCAs): Single-Linkage (SL), Average-Linkage (AL) and Complete-Linkage (CL) with 12 distance measure have been used to cluster RNA-seq Samples. The same methods will be used in our study along with hierarchical clustering with Poisson distribution [15].

3.2.2 k-medoids

K-medoids is a partitional clustering algorithm proposed in 1987 by Kaufman and Rousseeuw. It is a variant of the K-means algorithm that is less sensitive to noise and outliers because it uses medoids as cluster centers instead of means that are easily influenced by extreme values. Medoids are the most centrally located objects of the clusters, with a minimum sum of distances to other points. After searching for k representative objects in a data set, the algorithm which is called Partitioning Around Medoids (PAM) assigns each object to the closest medoid in order to create clusters. Like in k-means the number of classes to be generated needs to be specified.

3.3 Model-based clustering

Yaqing Si et al. described a number of Model based clustering methods for RNA-seq data in their paper [16], these methods assume that data are generated by a mixture of probability distributions: Poisson distribution when only technical replicates are used and Negative binomial distribution when working with biological replicates. The first method they proposed is a model-based clustering method with the expectation-maximization algorithm (MB-EM) for clustering RNA-seq gene expression profile. The expectation-maximization algorithm is widely used in many computational biology applications, the authors in [17] explain how this algorithm works and when it is used. The second method is an initialization algorithm for cluster centers, the idea behind this method is to randomly choose one cluster center and then gradually add centers by selecting genes based on the distance between each gene and each of the selected centers. Two other stochastic algorithms have been proposed in this paper, a stochastic version of the expectation-maximization algorithm and a classification expectation maximization algorithm with simulated annealing. The last method in this paper is a model-Based Hybrid-Hierarchical Clustering Algorithm, it does not require to pre-specify the number of clusters to be generated as it is required by the previous methods. The authors propose to use agglomerative clustering starting with k0 clusters to speed up the calculation, then, it repeatedly identifies the two clusters that are closest together and merges the two most similar clusters. This method was called hybrid because it combines two steps: Obtaining the initial K0 clusters using one of the previous described algorithms then agglomerative clustering to build the hierarchical tree.

3.4 Classification and clustering algorithms of machine learning for RNA-seq data

Classification in machine learning is a supervised learning approach in which the algorithm learns from the data given to it and makes new observations, then applies the conclusions to new data. Clustering on the other hand is an unsupervised learning problem for grouping unlabeled features. The learning algorithm that learns the model from the training data and maps the input data to a specific class is called classifier, in the following section, we briefly present three widely used classifiers for grouping RNA-seq data.

  • Random forests (RF): an ensemble method that trains a large number of individual decision trees, each tree gives a class prediction, the category that wins the majority votes is used as the final decision of the random forest model. The algorithm can perform both classification and regression tasks and has better accuracy among current algorithms.

  • Support Vector Machine (SVM): one of the most popular supervised learning models, used for both classification and regression, the data points are separated using an optimal hyperplane or a set of hyperplanes in a multidimensional space with the maximum possible margin between support vectors.

  • Poisson linear discriminant analysis: an approach used for the classification and clustering of RNA-seq data using a Poisson log linear model [15].

To test these algorithms, we used MLSeq (Machine learning interface for RNA-sequencing data) which is an R package including more than 80 machine learning algorithms and a pipeline to classify RNA-seq data including normalization, filtering and transformation steps [18].

3.5 Clustering with deep learning

Deep learning is also a technique that can be used to learn better data representation of high-dimensional data. The two recently published surveys [19, 20] present a taxonomy of existing deep clustering algorithms, by describing the different Neural Network Architecture that exists for feature representation, clustering loss function and Performance Evaluation Metrics for Deep Clustering. In [20], the authors categorize current deep clustering models into following three categories:

  • Auto-Encoders Based Deep Clustering

  • CDNN-Based Deep Clustering (feed-forward networks trained only by specific clustering)

  • Generative Adversarial Network (GAN)

These approaches are already used in the analysis of RNA-seq data, for example, an unsupervised deep embedding algorithm that clusters single cell (scRNA-seq) data was proposed in [21], another paper use a Lasso model and a multilayer feed-forward artificial neural network to analyze RNA-Seq gene expression data [22]. In [23], the authors used a Deep Neural Network model from the R package h2o for cancer data classification and in [24], ladder networks were used for gene expression classification.

4. Clustering algorithms and software packages/tools corresponding to the algorithms

Clustering algorithms and software packages corresponding to the algorithms are shown in Table 1.

MethodsImplementation in R
Hierarchical clusteringhclust() function in “stats”
k-means“cluster”, “factoextra”
k-medoids“cluster”, “factoextra”
Model-based clustering with the expectation-maximization algorithm (MB-EM).
Stochastic version of the
expectation-maximization algorithms (Deterministic annealing (DA) algorithm).
Classification expectation maximization (CEM) algorithm with simulated annealing (SA).
Machine learning algorithms“MLSeq”

Table 1.

Clustering algorithms and software packages corresponding to the algorithms.

5. Clustering of public RNA-seq data from recount2

Recount2 is a multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets. It contains 2041 different studies and over 70,000 human RNA-seq [25]. We selected for our study four different datasets based on the number of samples and the number of classes. We then performed sample-based clustering on each dataset and compared the results to the classes in the phenotype table in recount2 to evaluate the performance of each method. The methods used to classify the data are 3 subtypes of the hierarchical clustering with the Euclidean distance, hierarchical clustering with a Poisson model and k-medoids.

5.1 Datasets

Description of the four datasets from recount2 is shown in Table 2.

Dataset (accession)Number of samplesNumber of classesClasses
SRP03278920417 breast tumor samples of three different subtypes:
  • TNBC.

  • Non-TNBC.

  • HER2-positive.

SRP0490975443 subtypes of Leiomyosarcoma:
  • 8 LMS cases from subtype I

  • 6 cases from subtype II

  • 3 cases from subtype III

  • 7 cases of normal tissues

  • 28 breast cancer cell lines.

  • 42 Triple Negative Breast Cancer (TNBC) primary tumors.

  • 42 Estrogen Receptor Positive (ER+) and HER2 Negative Breast Cancer primary tumors.

  • 30 uninvolved breast tissue samples that were adjacent to ER+ primary tumors.

  • 5 breast tissue samples from reduction mammoplasty procedures performed on patients with no known cancer.

  • 21 uninvolved breast tissue samples that were adjacent to TNBC primary tumors.

  • 39 contrast-enhancing glioma core samples.

  • 36 non-enhancing FLAIR glioma margin samples.

  • 17 non-neoplastic brain tissue samples.

Table 2.

Description of the four datasets from recount2.

5.2 Adjusted Rand Index

There are several similarity measures for cluster evaluation, we chose to work with the adjusted Rand index which is the corrected-for-chance version of the Rand index. It is a measure used in data clustering to evaluate the performance of a clustering method, by comparing the results of a clustering algorithm against known classes from external criteria [26]. In our study, we performed different sample-based classification method on four different datasets, after that, we compared the results to the class labels we associated to each sample based on the field “characterization of the samples” in the phenotype table in recount2, then we used the ARI for cluster validation.

5.3 Standard deviation

Figures 46 are examples to show the standard deviation of the transformed data, across samples, against the mean, using the shifted logarithm transformation, the regularized log transformation and the variance stabilizing transformation.

Figure 4.

Standard deviation of the transformed data using the shifted logarithm transformation.

Figure 5.

Standard deviation of the transformed data using the regularized log transformation.

Figure 6.

Standard deviation of the transformed data using the variance stabilizing transformation.

5.4 Machine learning classification

Three widely used machine learning algorithms were used for the classification of the four datasets, Random forests, support vector machine and Poisson linear discriminant analysis. To perform this analysis, we first split the data into two parts as training and test sets, with 70% of samples for the training dataset, and the remaining 30% samples for the testing dataset, the training set is used to fit the parameters of the model, that is used thereafter to predict the responses for the observations in the test dataset. Normalization was applied with Deseq median ratio method and the variance stabilizing transformation was applied for the normalization of the dataset. The model was trained using 5-fold cross validation repeated 2 times. The number of levels for tuning parameters is set to 10.

5.5 Results

5.6 Computational time

All experiments are performed on a machine with 16 GB RAM, 1024 GB hard disk running with a windows operating system and MLSeq R package.

6. Discussion and conclusion

Clustering the samples of the three datasets to the sub-classes defined in the phenotype table of recounts2 was not easy. We first tried to visualize the separation between the subtypes using principal component analysis (Figures 7 and 810), then using 4 variants of the hierarchical clustering and k-medoids we classified the samples of each dataset (Figures 11 and 12 show the hierarchical clustering plots of the dataset SRP032789). The performance of the 5 methods was different depending on the dataset (Tables 35), making it impossible to make a general system of recommendation. However, we can see that the k-medoid method has relatively better performance than the other methods for all the datasets. In the second part of the study, we compared a few machine learning methods used for the classification of RNA-seq data. The performance of the models surpasses the classical methods used before, also RF and PLDA performed better than SVM which does not perform very well when the data set is large and has noise. Note that the model accuracies given in this study should not be considered as a generalization. The results can depend on several criteria: normalization and transformation methods, gene-wise overdispersions, outliers, number of classes etc. (Tables 611).

Figure 7.

PCA of data from the study SRP032789.

Figure 8.

PCA of the data from the study SRP049097.

Figure 9.

PCA of the data from the study SRP042620.

Figure 10.

PCA of the data from the study SRP044668.

Figure 11.

Dendrograms obtained for the dataset from SRP032789 study using three variants of the hierarchical clustering method with the Euclidean distance.

Figure 12.

Dendrograms obtained for the dataset from the study SRP032789 using the hierarchical clustering method with the Poisson distance.

hclus (complete)hclust (single)hclust (average)hclust (complete)k-medoids
EuclideanEuclideanEuclideanPoisson distanceEuclidean

Table 3.

Performance of clustering methods (SRP032789).

hclus (complete)hclust (single)hclust (average)hclust (complete)k-medoids
EuclideanEuclideanEuclideanPoisson distanceEuclidean

Table 4.

Performance of clustering methods (SRP049097).

hclus (complete)hclust (single)hclust (average)hclust (complete)k-medoids
EuclideanEuclideanEuclideanPoisson distanceEuclidean

Table 5.

Performance of clustering methods (SRP042620).

hclus (complete)hclust (single)hclust (average)hclust (complete)k-medoids
EuclideanEuclideanEuclideanPoisson distanceEuclidean

Table 6.

Performance of clustering methods (SRP044668).


Table 7.

Classification results for SRP032789 data.


Table 8.

Classification results for SRP049097 data.


Table 9.

Classification results for SRP042620 data.


Table 10.

Classification results for SRP044668 data.


Table 11.

Computational time in seconds.

Conflict of interest

The authors declare no conflict of interest.

Download for free

chapter PDF

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Ismail Jamail and Ahmed Moussa (December 23rd 2020). Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq [Online First], IntechOpen, DOI: 10.5772/intechopen.94069. Available from:

chapter statistics

125total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us