Clustering algorithms and software packages corresponding to the algorithms.
Abstract
Latest developments in highthroughput cDNA sequencing (RNAseq) have revolutionized gene expression profiling. This analysis aims to compare the expression levels of multiple genes between two or more samples, under specific circumstances or in a specific cell to give a global picture of cellular function. Thanks to these advances, gene expression data are being generated in large throughput. One of the primary data analysis tasks for gene expression studies involves datamining techniques such as clustering and classification. Clustering, which is an unsupervised learning technique, has been widely used as a computational tool to facilitate our understanding of gene functions and regulations involved in a biological process. Cluster analysis aims to group the large number of genes present in a sample of gene expression profile data, such that similar or related genes are in same clusters, and different or unrelated genes are in distinct ones. Classification on the other hand can be used for grouping samples based on their expression profile. There are many clustering and classification algorithms that can be applied in gene expression experiments, the most widely used are hierarchical clustering, kmeans clustering and modelbased clustering that depend on a model to sort out the number of clusters. Depending on the data structure, a fitting clustering method must be used. In this chapter, we present a state of art of clustering algorithms and statistical approaches for grouping similar gene expression profiles that can be applied to RNAseq data analysis and software tools dedicated to these methods. In addition, we discuss challenges in cluster analysis, and compare the performance of height commonly used clustering methods on four different public datasets from recount2.
Keywords
 clustering
 classification
 RNAseq
 gene expression
 adjusted Rand index
 machine learning
 deep learning
1. Introduction
In recent years, RNAseq based on Next generation Sequencing has become an attractive alternative for conducting quantitative analysis of gene expression. This approach offers a number of advantages compared to microarray analysis such as the discovery of novel RNA species (RNAseq is not limited by prior knowledge of the genome of the organism, it can be used for the detection of novel transcripts), the higher sensitivity for genes expressed either at low or very high level and the unbiased approach compared to microarrays that are subject to crosshybridization bias. Overall, RNAseq is a better technique for many applications such as novel gene identification, differential gene expression, and splicing analysis.
The principle of RNAseq is based on highthroughput next generation sequencing (NGS) technologies. The first step in the technique involves converting the population of RNA to be sequenced into cDNA fragments with adaptors attached to one or both ends, each molecule is then sequenced to obtain either single end short sequence reads or paired end reads [1]. These reads are stored in fastq files formats and consist of raw data for many analysis pipelines (Figure 1).
The primary objective of this chapter is to present algorithms for clustering gene expression data from RNAseq. Therefore, in the first section, we will describe the different steps of the gene expression analysis workflow from preprocessing the raw reads to gene expression clustering and classification. In the second part of the chapter we will describe traditional, modelbased and machine learning clustering methods for gene expression data, then we will conclude this chapter with a study for clustering samples of four public datasets from recount2, using different clustering methods and also evaluating the performance of each one using the adjusted rand index (RDI) and accuracy.
2. RNAseq data analysis
RNAseq has become a common tool for scientists to study the transcriptome complexity, and a convenient method for the analysis of differential gene expression. A typical RNAseq data analysis workflow starts by preprocessing raw reads for contamination removal and quality control checks. The following step is to align the reads to a reference genome, or to make a de novo assembly if there is not any. Following the alignment, the quantification step aims to quantify aligned reads to produce a count matrix to use as entry data for Differential Expression (DE) analysis. Normalization and DE analysis normally go together as most of the methods have builtin normalization and accept only raw count matrix. For this study, we are more interested in the clustering step, we will perform Normalization of the raw counts separately and do the clustering without going through differential gene expression analysis. In the following section we describe with more details each step of the pipeline (Figure 2).
2.1 Preprocessing
Preprocessing raw reads consist of checking the quality of the reads, adapters trimming, removal of short reads and filtering bad quality bases. Tools like FastQC can generate a report summarizing the overall quality of the sequence information [2]. Based on this report we can determine how the quality trimming should be set up. Trimmomatic is one of many tools used to clean up the raw data. It can be used to remove adapters from the reads, trim off any lowquality bases at the ends of reads, and filter short reads that can align to multiple locations on the reference genome. Once the trimming step is done, it is a good practice to recheck the quality of the reads by rerunning FastQC.
2.2 Alignment
Now that we have explored the quality of our raw reads, we can move on to read alignment. Read alignment is one of the first steps required for many different types of analysis. It aims to map the huge number of short RNA sequences generated by NGS instruments (reads) to a reference genome in order to identify the correct genomic loci from which the read originated. In RNAseq, alignment is a major step for the calculation of transcript or gene expression levels; several spliceaware alignment methods have been developed for RNAseq experiments such as STAR, HISAT2 or TopHat. These aligners are designed to specifically address many of the challenges of RNAseq data mapping using a strategy to account for spliced alignments [3, 4, 5].
2.3 Quantification
Quantification of gene expression is to count the number of reads that map to each gene using methods such as HTSeqcount, FeatureCounts or kallisto [6, 7, 8]. This step is crucial if we want to do a gene differential expression analysis, which means to identify genes (or transcripts), if any, that have a statistically significant difference in abundance across the experimental groups or conditions.
2.4 Normalization
The read counts generated in the quantification step need to be normalized to make accurate comparisons of gene expression between samples or when doing an exploratory data analysis. Several normalization methods are used for this purpose such as CPM (counts per million), TPM (transcripts per kilobase million), RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped), DESeq2’s median of ratios and EdgeR’s trimmed mean of M values (TMM) [9].
2.5 Clustering
Cluster analysis techniques have proven to be helpful to understand gene expression data by uncovering unknown relationships among genes and unveiling different subtypes of diseases when it comes to clustering biological samples [10]. In the following section, we present methods for samplebased and genebased clustering, starting with traditional methods used after data transformation then modelbased clustering for data generated using a combination of probability distributions (Figure 3).
3. Clustering methods for gene expression data
3.1 Data transformation methods
Traditional clustering algorithms like hierarchical clustering and kmeans cannot be directly applied to RNAseq count data, to apply these methods for cluster analysis of RNAseq data, that tend to follow an overdispersed Poisson or negative binomial distribution, we need to transform the data in order to have a distribution closer to the normal distribution. In the following section, we present popular methods for data transformation:
Logarithmic, widely used method to deal with skewed data in many research domains, often used to reduce the variability of the data and make the data conform more closely to the normal distribution. However, it was demonstrated in [11], that in most circumstances the log transformation does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed.
Variance stabilizing transformation: This method was used to transform microarray data to stabilize the asymptotic variance over the full range of the data [12].
Eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNAseq data analysis were proposed in [13], these methods deal with the two common properties when it come to the count matrix generated in the quantification step, Sparsity and Skewness; Sparsity means that many counts in the count matrix are zero. Skewness means that the histogram of all counts in the count matrix is usually skewed.
3.2 Clustering methods based on normal distribution
3.2.1 Hierarchical methods
Hierarchical clustering method is the most popular method for gene expression data analysis. In hierarchical clustering, genes with similar expression patterns are grouped together and are connected by a series of branches (clustering tree or dendrogram). Experiments with similar expression profiles can also be grouped together using the same method. This clustering technique is divided into two types: agglomerative and divisive. In an agglomerative or bottomup clustering method each observation is assigned to its own cluster. In a comparative study on Cancer data [14], three variants of Hierarchical Clustering Algorithms (HCAs): SingleLinkage (SL), AverageLinkage (AL) and CompleteLinkage (CL) with 12 distance measure have been used to cluster RNAseq Samples. The same methods will be used in our study along with hierarchical clustering with Poisson distribution [15].
3.2.2 kmedoids
Kmedoids is a partitional clustering algorithm proposed in 1987 by Kaufman and Rousseeuw. It is a variant of the Kmeans algorithm that is less sensitive to noise and outliers because it uses medoids as cluster centers instead of means that are easily influenced by extreme values. Medoids are the most centrally located objects of the clusters, with a minimum sum of distances to other points. After searching for k representative objects in a data set, the algorithm which is called Partitioning Around Medoids (PAM) assigns each object to the closest medoid in order to create clusters. Like in kmeans the number of classes to be generated needs to be specified.
3.3 Modelbased clustering
Yaqing Si et al. described a number of Model based clustering methods for RNAseq data in their paper [16], these methods assume that data are generated by a mixture of probability distributions: Poisson distribution when only technical replicates are used and Negative binomial distribution when working with biological replicates. The first method they proposed is a modelbased clustering method with the expectationmaximization algorithm (MBEM) for clustering RNAseq gene expression profile. The expectationmaximization algorithm is widely used in many computational biology applications, the authors in [17] explain how this algorithm works and when it is used. The second method is an initialization algorithm for cluster centers, the idea behind this method is to randomly choose one cluster center and then gradually add centers by selecting genes based on the distance between each gene and each of the selected centers. Two other stochastic algorithms have been proposed in this paper, a stochastic version of the expectationmaximization algorithm and a classification expectation maximization algorithm with simulated annealing. The last method in this paper is a modelBased HybridHierarchical Clustering Algorithm, it does not require to prespecify the number of clusters to be generated as it is required by the previous methods. The authors propose to use agglomerative clustering starting with k0 clusters to speed up the calculation, then, it repeatedly identifies the two clusters that are closest together and merges the two most similar clusters. This method was called hybrid because it combines two steps: Obtaining the initial K0 clusters using one of the previous described algorithms then agglomerative clustering to build the hierarchical tree.
3.4 Classification and clustering algorithms of machine learning for RNAseq data
Classification in machine learning is a supervised learning approach in which the algorithm learns from the data given to it and makes new observations, then applies the conclusions to new data. Clustering on the other hand is an unsupervised learning problem for grouping unlabeled features. The learning algorithm that learns the model from the training data and maps the input data to a specific class is called classifier, in the following section, we briefly present three widely used classifiers for grouping RNAseq data.
Random forests (RF): an ensemble method that trains a large number of individual decision trees, each tree gives a class prediction, the category that wins the majority votes is used as the final decision of the random forest model. The algorithm can perform both classification and regression tasks and has better accuracy among current algorithms.
Support Vector Machine (SVM): one of the most popular supervised learning models, used for both classification and regression, the data points are separated using an optimal hyperplane or a set of hyperplanes in a multidimensional space with the maximum possible margin between support vectors.
Poisson linear discriminant analysis: an approach used for the classification and clustering of RNAseq data using a Poisson log linear model [15].
To test these algorithms, we used MLSeq (Machine learning interface for RNAsequencing data) which is an R package including more than 80 machine learning algorithms and a pipeline to classify RNAseq data including normalization, filtering and transformation steps [18].
3.5 Clustering with deep learning
Deep learning is also a technique that can be used to learn better data representation of highdimensional data. The two recently published surveys [19, 20] present a taxonomy of existing deep clustering algorithms, by describing the different Neural Network Architecture that exists for feature representation, clustering loss function and Performance Evaluation Metrics for Deep Clustering. In [20], the authors categorize current deep clustering models into following three categories:
AutoEncoders Based Deep Clustering
CDNNBased Deep Clustering (feedforward networks trained only by specific clustering)
Generative Adversarial Network (GAN)
These approaches are already used in the analysis of RNAseq data, for example, an unsupervised deep embedding algorithm that clusters single cell (scRNAseq) data was proposed in [21], another paper use a Lasso model and a multilayer feedforward artificial neural network to analyze RNASeq gene expression data [22]. In [23], the authors used a Deep Neural Network model from the R package h2o for cancer data classification and in [24], ladder networks were used for gene expression classification.
4. Clustering algorithms and software packages/tools corresponding to the algorithms
Clustering algorithms and software packages corresponding to the algorithms are shown in Table 1.
Methods  Implementation in R 

Hierarchical clustering  hclust() function in “stats” 
kmeans  “cluster”, “factoextra” 
kmedoids  “cluster”, “factoextra” 
SOM  “kohonen” 
Modelbased clustering with the expectationmaximization algorithm (MBEM). Stochastic version of the expectationmaximization algorithms (Deterministic annealing (DA) algorithm). Classification expectation maximization (CEM) algorithm with simulated annealing (SA).  “MBCluster.Seq” 
Machine learning algorithms  “MLSeq” 
5. Clustering of public RNAseq data from recount2
Recount2 is a multiexperiment resource of analysisready RNAseq gene and exon count datasets. It contains 2041 different studies and over 70,000 human RNAseq [25]. We selected for our study four different datasets based on the number of samples and the number of classes. We then performed samplebased clustering on each dataset and compared the results to the classes in the phenotype table in recount2 to evaluate the performance of each method. The methods used to classify the data are 3 subtypes of the hierarchical clustering with the Euclidean distance, hierarchical clustering with a Poisson model and kmedoids.
5.1 Datasets
Description of the four datasets from recount2 is shown in Table 2.
Dataset (accession)  Number of samples  Number of classes  Classes 

SRP032789  20  4  17 breast tumor samples of three different subtypes:

SRP049097  54  4  3 subtypes of Leiomyosarcoma:

SRP042620  168  6 

SRP044668  94  3 

5.2 Adjusted Rand Index
There are several similarity measures for cluster evaluation, we chose to work with the adjusted Rand index which is the correctedforchance version of the Rand index. It is a measure used in data clustering to evaluate the performance of a clustering method, by comparing the results of a clustering algorithm against known classes from external criteria [26]. In our study, we performed different samplebased classification method on four different datasets, after that, we compared the results to the class labels we associated to each sample based on the field “characterization of the samples” in the phenotype table in recount2, then we used the ARI for cluster validation.
5.3 Standard deviation
Figures 4–6 are examples to show the standard deviation of the transformed data, across samples, against the mean, using the shifted logarithm transformation, the regularized log transformation and the variance stabilizing transformation.
5.4 Machine learning classification
Three widely used machine learning algorithms were used for the classification of the four datasets, Random forests, support vector machine and Poisson linear discriminant analysis. To perform this analysis, we first split the data into two parts as training and test sets, with 70% of samples for the training dataset, and the remaining 30% samples for the testing dataset, the training set is used to fit the parameters of the model, that is used thereafter to predict the responses for the observations in the test dataset. Normalization was applied with Deseq median ratio method and the variance stabilizing transformation was applied for the normalization of the dataset. The model was trained using 5fold cross validation repeated 2 times. The number of levels for tuning parameters is set to 10.
5.5 Results
5.6 Computational time
All experiments are performed on a machine with 16 GB RAM, 1024 GB hard disk running with a windows operating system and MLSeq R package.
6. Discussion and conclusion
Clustering the samples of the three datasets to the subclasses defined in the phenotype table of recounts2 was not easy. We first tried to visualize the separation between the subtypes using principal component analysis (Figures 7 and 8–10), then using 4 variants of the hierarchical clustering and kmedoids we classified the samples of each dataset (Figures 11 and 12 show the hierarchical clustering plots of the dataset SRP032789). The performance of the 5 methods was different depending on the dataset (Tables 3–5), making it impossible to make a general system of recommendation. However, we can see that the kmedoid method has relatively better performance than the other methods for all the datasets. In the second part of the study, we compared a few machine learning methods used for the classification of RNAseq data. The performance of the models surpasses the classical methods used before, also RF and PLDA performed better than SVM which does not perform very well when the data set is large and has noise. Note that the model accuracies given in this study should not be considered as a generalization. The results can depend on several criteria: normalization and transformation methods, genewise overdispersions, outliers, number of classes etc. (Tables 6–11).
hclus (complete)  hclust (single)  hclust (average)  hclust (complete)  kmedoids 

Euclidean  Euclidean  Euclidean  Poisson distance  Euclidean 
0.4146015  0.3818763  0.4146015  0.4146015  0.6798897 
hclus (complete)  hclust (single)  hclust (average)  hclust (complete)  kmedoids 

Euclidean  Euclidean  Euclidean  Poisson distance  Euclidean 
0.02880412  −0.003409256  0.0005777741  0.1874828  0.2791547 
hclus (complete)  hclust (single)  hclust (average)  hclust (complete)  kmedoids 

Euclidean  Euclidean  Euclidean  Poisson distance  Euclidean 
0.1944569  0.005551586  0.1285448  0.1468464  0.2579758 
hclus (complete)  hclust (single)  hclust (average)  hclust (complete)  kmedoids 

Euclidean  Euclidean  Euclidean  Poisson distance  Euclidean 
0.2379903  −0.007755123  0.399417  0.2657942  0.3771837 
Classifier  Accuracy  Kappa 

rf  1  1 
SVM  0.6667  0.5 
PLDA  1  1 
Classifier  Accuracy  Kappa 

rf  0.8235  0.765 
SVM  0.7647  0.6909 
PLDA  0.7647  0.6866 
Classifier  Accuracy  Kappa 

rf  0.9412  0.9249 
SVM  0.5882  0.4685 
PLDA  0.7843  0.7267 
Classifier  Accuracy  Kappa 

rf  0.8214  0.7271 
SVM  0.6786  0.5218 
PLDA  0.7143  0.5573 
SRP032789  SRP049097  SRP042620  SRP044668  

rf  176.67  781.31  4234.89  1412.19 
SVM  1080.92  2333.52  6645.21  1597.89 
PLDA  31.45  60.93  234.98  72.66 
Conflict of interest
The authors declare no conflict of interest.