CoClust: An R Package for Copula-Based Cluster Analysis

The aim of this chapter is to present and describe the R package CoClust, which enables implementing a clustering algorithm based on the copula function. The copula-based clustering algorithm, called CoClust, was introduced by Di Lascio and Giannerini in 2012 (Journal of Classification, 29(1):50–75), improved in 2016 (Statistical Papers, p.1–17, DOI 10.1007/s00362-016-0822-3), and is able to find clusters according to the complex multivariate dependence structure of the data-generating process. Hence, among other advantages, the CoClust overcomes the limitations of classic approaches that only deal with linear bivariate relationships. The first part of the chapter briefly describes the clustering algorithm. The second part illustrates the clustering procedure through the R package CoClust and presents numerical examples showing how the main R commands can be used to perform a fully developed clustering of multivariate dependent data.


Introduction
Cluster analysis is an unsupervised classification method that aims to detect a structure within data by assigning a set of objects (observations or variables) into groups, called clusters, whereby objects in the same cluster are in some sense more closely related to each other than objects assigned to different clusters.
Literature on clustering methods is very extensive and different criteria are taken into account to organize and present these methods. One such criterion is the mathematical object of the clustering methods: a distance or dissimilarity measure versus a probability model. In this chapter, we consider the model-based clustering methods that assume the data matrix is generated according to a specific data generating process (henceforth DGP). The classic model-based clustering method in [1,2] is based on a mixture of multivariate probability distributions, such as the multivariate normal. However, this approach only accounts for the linear dependence between objects so that it inherits all the limitations of the linear correlation coefficient. Hence, we here focus on clustering methods that assume the data matrix is generated by a K-dimensional copula [3] such that each of the K clusters is represented by a (continuous) univariate density function and the complex multivariate relationship among clusters is expressed by the copula and its dependence parameter. Specifically, this chapter aims to describe the copula-based clustering algorithm first introduced by [4] and improved by [5], presenting in detail its implementation in the R package called CoClust. The CoClust approach inspired the work of [6], while different copulabased clustering approaches can be found in [7][8][9], [10][11][12][13], [14], and in [15,16]. To the best of our knowledge, none of these methods have been implemented in software available to the scientific community.
Most clustering algorithms take as input some parameters, such as number of clusters, the distance or density of clusters, or the number of points in a cluster, and a starting classification. Some important benefits of the R function CoClust, which implements the copula-based clustering algorithm in [5], are that (i) the user can simultaneously test a multiple number of clusters in a single function call, (ii) there is no need for a starting classification, and (iii) the algorithm can nonparametrically estimate the density of the clusters, which are distributional free. The package is available from the Comprehensive R Archive Network (CRAN) at https:// cran.r-project.org/web/packages/CoClust/index.html.
The chapter is organized as follows. Section 2 presents the theoretical tools of copula theory essential to understanding the copula-based clustering algorithm introduced and described in Section 3. Section 4 describes in detail the R implementation of the CoClust algorithm and illustrates its use on simulated DGPs. In Section 5, an application to a real dataset is presented, while a brief conclusion follows in Section 6.

Copula theory
The copula function [17][18][19][20] was born in the probabilistic metric space with Sklar's theorem [3] stating that every joint distribution function F Á ð Þ can be expressed in terms of K marginal distribution function F k and the copula distribution function C as follows: for all x 1 ; …; x k ; …; x K ð Þ ∈ R K (where R denotes the extended real line). According to this theorem, any joint probability function f Á ð Þ can be split into the margins f k Á ð Þ and a copula c Á ð Þ, so that the latter represents the association among variables, that is, the multivariate dependence structure of a joint density function [17][18][19][20]: Note that the scaling factor n= n þ 1 ð Þin Eq. (6) is typically introduced in the nonparametric computation of the margins to avoid numerical problems at the boundary of 0; 1 ½ K .

Copula models
While many different copula models are available in literature (see [18,19] for details), the Elliptical and Archimedean families are shown to be the most useful in empirical modeling. The Elliptical family includes the Gaussian copula and the t-copula: both are symmetric, exhibit the strongest dependence in the middle of the distribution, and can take into account both positive and negative dependence, since À1 ≤ θ ≤ 1. The Archimedean family enables describing both left and right asymmetry as well as weak symmetry among the margins using the Clayton, Gumbel, and Frank models, respectively. Clayton's copula has the parameter θ ∈ 0; ∞ ð Þ and as θ approaches zero, the margins become independent. The dependence parameter θ of a Gumbel model is restricted to the interval 1; þ∞ ½ Þwhere the value 1 means independence. Finally, the dependence parameter θ of a Frank copula may assume any real value and as θ approaches zero, the marginal distributions become independent. According to the type of copula model, the value of θ has a specific meaning and the magnitudes of the dependence parameter are not comparable across copulas. It is always true that the greater the value of the dependence parameter, the stronger the association among the margins, but since the relationship between θ and the concordance measures is well known, it is standard to convert θ to these, for example, to the Kendall's τ correlation coefficient. The families of copula models considered here are described in Table 1 and shown in Figure 1 in their bivariate version. Note that here only single parameter copula models are considered. The copula model selection task is still an open research field. Although various statistical tests enable evaluating whether a specific model is plausible or not, no tool has thus far been recognized as the best. In the copula-based clustering context, this issue can be overcome, Table 1. Some standard single parameter bivariate copulas with the range of the dependence parameter θ and its relation with Kendall's τ. u k with k ¼ 1, 2 are uniformly distributed variates so that x k ¼ F À1 u k ð Þ $ F k . Φ is the cumulative distribution function (cdf) of the standard normal distribution, Φ G u 1 ; u 2 ð Þis the standard bivariate normal distribution, t 2, ν Á; Á; θ ð Þdenotes the standard bivariate student-t distribution with ν degrees of freedom, and t À1 ν the inverse univariate student-t distribution function. D 1 x ð Þ denotes the "Debye" function 1=x Ð x 0 t= exp t À 1 since the choice of the type of model would seem less important in terms of the goodness of the final clustering (see [5], Section 4.3), and a classic information criterion can be used, such as the Bayesian or the Akaike information criterion. This topic is discussed in detail in Section 3.

CoClust algorithm
Di Lascio and Giannerini [4] proposed a clustering algorithm called CoClust that is able to cluster multivariate observations with a complex dependence structure. The basic underlying concept of CoClust is clustering multivariate dependent observations based on the likelihood copula fit estimated on the previously allocated observations. To do so, the CoClust assumes that the data are generated by a multivariate copula function whose arguments represent the clusters, and each cluster is thus generated by a (marginal) univariate density function. The type and strength of multivariate dependence across clusters are modeled through a copula function and its dependence parameter, respectively. Being copula-based, CoClust inherits all the advantages of copula theory, and the multivariate complex dependence structure of the DGP can be taken into account to perform the cluster analysis. However, CoClust in its first version had some significant limitations. For example, it automatically allocated all the observations to the clusters without discarding potentially irrelevant observations, implying a high computational burden. Di Lascio and Giannerini [5] proposed a new version of the CoClust algorithm that satisfactorily overcomes these limitations. This section hereafter describes the latest version of the CoClust algorithm, which is implemented in the R package CoClust.
The starting point of the CoClust algorithm is the standard n Â p ð Þdata matrix X: in which n p-dimensional objects have to be grouped in K groups. CoClust can be applied either to the row or to the column data matrix according to the purpose of the analysis. In both cases, CoClust works with vectors and treats each row (column) of the data matrix X as a single element to be allocated to a cluster. The values within a row (or column) vector are treated as independent realizations of the same density function, thus observations for each cluster from the same distribution. Here, CoClust is described as applied to the rows of the data matrix.

The basic procedure and selection of the number of clusters
The basic idea behind the CoClust consists in a forward procedure that allocates a K-plet of the row data matrix at a time, that is, a p-dimensional vector for each cluster at a time, and the decision on the allocation of each K-plet of rows is based on the value of the log-likelihood of the copula fit. This likelihood is computed by using the K-plets already allocated and one allocation candidate, say

by varying the permutations of observa-
tions in x i 0 in order to find, if it exists, the combination that maximizes the copula fit. If the loglikelihood of the copula fitted on the observations already allocated plus the permutation of the selected K-plet increases, then the candidate K-plet is allocated to the clusters; otherwise, it is discarded, since theoretically it could be either independent from the identified DGP or derive from another DGP.
Before describing the clustering algorithm procedure, two aspects of the CoClust merit a discussion: the construction of the K-plet candidate to the allocation and the selection of number of clusters K. The K-plet of rows candidate to the allocation is constructed based on the following function H Á ð Þ, which is a sort of multivariate measure of association based on the pairwise Spearman's r correlation coefficient: where Λ 1 is the subset of row index vectors already selected to compose a K-plet, Λ 2 is the subset of the remaining candidate row index vectors to complete it, and ψ is an aggregation function, for instance, the mean, median, or maximum.
As for the selection of K, one of the advantages of CoClust with respect to classic clustering techniques is its ability to automatically choose the number of clusters. Indeed, CoClust explores all the possibilities among those given by the user and selects the K on the basis of the log-likelihood of the copula estimated on the subsets of k-plets allocated up to the user's predefined step. The technical details are given below.
The main steps of the CoClust algorithm to cluster the n row data matrix are described in the following: 1. by varying the number of clusters k in the set of possibilities defined by the user and such that 2 ≤ k ≤ n, a. select a subset of n k k-plets of rows in the data matrix in Eq. (8) on the basis of the measure in Eq. (9); b. fit the copula model on the n k k-plets of rows through the semiparametric estimation method described in Section 2; 2. select the subset of n k k-plets of rows, say n K K-plets, that maximizes the log-likelihood of the copula; hence, the number of clusters K, that is, the dimension of the copula, is automatically chosen and n K K-plets are already allocated; 3. select a K-plet of rows among those remaining by using the measure in Eq. (9) and estimate K! copulas by using the observations already clustered and a permutation of the candidate to the allocation; 4. allocate the permutation of the selected K-plet to the clustering by assigning each row to the corresponding cluster only if it increases the log-likelihood of the copula fit, otherwise drop the entire K-plet of rows; 5. repeat steps 3 and 4 until all the observations are evaluated, that is, either allocated or discarded.
At the end of the procedure, we obtain a clustering of K clusters each containing a maximum n=K ð Þp independent observations such that the multivariate dependence relationship across clusters can be revealed. Hence, attention in recovering the multivariate relationships does not rely on the within-cluster relationships, typical of classic clustering methods. A picture of the final CoClust clustering is given in Figure 2. Each cluster is a set of independent and identically distributed realizations from the same marginal distribution while observations across clusters share the same multivariate dependence structure.
Note that since in each step of the procedure non-nested models are compared, that is, copula models with a single dependence parameter, the described log-likelihood based criterion is equivalent to the well-known Bayes information criterion and Akaike information criterion. Finally, note that in the current CoClust version, the selection of the number of clusters K is based on a representative subset of n k observations. Hence, the algorithm chooses the number of clusters K by estimating the P Kmax k¼Kmin n k k fits required, where K min ; K max ½ is the range of the number of clusters predefined by the user with K min ≥ 2 and n k is chosen by the user with n k ≪ p. This allows keeping the computational complexity under control since it does not depend on sample size.

Selecting the copula model
The CoClust algorithm has not been implemented to automatically perform the selection copula model task and requires employing an information criterion a posteriori. The Bayesian information criterion (henceforth BIC) is expressed as follows for a K-dimensional copula model m: Figure 2. The basic concept underlying the CoClust algorithm. Each element in a cluster is a row data matrix of p elements.
where b θ is as in Eq. (5) or Eq. (7) with the summation over the number of allocated observations, which equals maximum n=K ð Þp (i.e., n=K p-dimensional vectors) and s is the number of parameters. According to [23], we select the copula model that minimizes the BIC. Similarly, the Akaike information criterion (henceforth AIC) results in: and can also be used to select the copula model.

Assessing the CoClust performance
The goodness of the CoClust algorithm in finding the true multivariate clustering structure underlying the data has been extensively investigated. Specifically, the first version of CoClust [4] was tested on simulated data for different scenarios and compared with model-based clustering [1,2]. This shows that, both when the DGP is a copula and when it is misspecified, CoClust appears to be able to identify both the true number of clusters and their size in most situations. Moreover, in comparing model-based clustering, CoClust appears better suited to clustering dependent data. In [5], a more sophisticated Monte Carlo study was carried out, investigating the new features of the current version of the CoClust algorithm. Here, the current version of the algorithm clearly outperforms the previous version by [4] and CoClust's ability to find the correct number of clusters and to reconstruct the true k-plets by varying the dimension of the copula, the aggregation function ψ, and the copula model appears to be very satisfactory. Furthermore, [5] also obtained good results in assessing CoClust's ability to drop from the clustering observations that are independent of the true DGP as well as distinguishing two different DGPs in the same dataset.
As for real data applications, the CoClust algorithm has been successfully applied to several datasets. In relation to biomedical applications, [4] apply the CoClust to microarray data to formulate hypotheses on the possible co-regulation and functional relations between genes, [5] use the copula-based clustering method to identify biologically and clinically relevant groups of tumor samples, and [24] attempt to identify organ type from cancer cell lines from tumors. Applications in other fields include [25], where the purpose of the analysis is investigating changes in EU country diets in accordance with common European policies and guidelines on healthy diets, and [24], where CoClust is used to investigate the geographic distribution of (annual maxima) rainfall measurements.

The R implementation of the CoClust algorithm
The copula-based clustering algorithm procedure is implemented in the R package CoClust [26]. It must be installed in the usual way, that is:

R> install.packages("CoClust")
and then it must be loaded through the usual code:

R> library("CoClust")
The code of the CoClust package is entirely written in R, to enable using an easily accessible open source system and the input/output facilities.

List of functions and subroutines
The main R function is CoClust(), which performs the copula-based clustering, while the following auxiliary R functions.

The CoClust function
The main function of the package CoClust is the R function CoClust(), which performs copula-based clustering as described in Section 3. Some options are present, which mainly allow us to: • fit a variety of copula models (by setting the argument copula) with different types of estimation procedures for margins and for copulas (arguments method.ma and method.c, respectively); specifically, all the copula models belonging to the Elliptical and the Archimedean family described in Section 2 can be estimated through the estimation methods implemented in the R package copula [27][28][29][30] that are maximum pseudo-likelihood estimators based on two different variance estimators, the inversion of Kendall's τ estimator and the inversion of Spearman's r estimator; as for the margins, two different estimation methods have been implemented, one parametric and one nonparametric: the maximum likelihood method as in Eq. (4) and the empirical cumulative distribution function in Eq. (6); • set the range or set of dimensions for the copula model, that is, number of clusters, for which the function tries the clustering (argument dimset); • set the dimension of the sample units used for selecting the number of clusters (argument noc); • select the combination function of the pairwise Spearman's r used to select the k-plets among the mean, the median, or the maximum (argument fun) as defined in Eq. (9); • specifies the likelihood criterion used for selecting the number of clusters among the AIC, the BIC (as defined in Eqs. (10) and (11)), and the log-likelihood without penalty terms (argument penalty).
The argument copula allows specifying a copula model among those described in Section 2.1.
As for the selection of the "best" model, CoClust can be run by varying the type of models of interest and selecting the one that fits best a posteriori using one of the criteria introduced in Section 3.2.
The main output of the function CoClust is an object of S4 class "CoClust" which is a list with the following elements: 1. Number of Clusters: the number K of selected and identified clusters; 2. Index Matrix: a n:obs Â K þ 1 ð Þmatrix where n.obs is the number of observations put into each cluster; the matrix contains the row indexes of the observations of the data matrix m (Eq. (8)) and in the last column the log-likelihood of the copula fit; 9. Index.dimset: a list that, for each k in dimset, contains the index matrix of the initial set of n k observations used to select the number of clusters, together with the associated maximized log-likelihood copula fit.

Simulated examples
This section shows how to use the CoClust package on data simulated from different DGPs.
In the first example, the data are drawn from a joint density function with different margins, whereas in the second example, a misspecified DGP is used.
In these examples, we focus only on the semi-parametric approach described in Section 2 due to its theoretical and computational advantages with respect to the full parametric approach. Moreover, the latter has only been implemented for Gaussian margins.

Example 1
In this example, we build a 3-variate joint density function through a 3-dimensional Frank copula with dependence parameter such that the Kendall's τ ¼ 0:7 and three different margins: a Gaussian with parameters μ ¼ 7, σ ¼ 2, a Gamma with shape and rate, respectively, set to 3 and 4, and a Beta with parameters α ¼ 2, β ¼ 1. To do so, we employ the function mvdc of the copula package [27][28][29][30]. Next, we generate a data matrix X with 15 rows and 21 columns and build the matrix of the true cluster indexes. Finally, we apply the function CoClust to the rows of X, recover the multivariate dependence structure of the data and compare the obtained clustering with the true one.
We apply the CoClust to the 15 Â 21 data matrix X using the maximum likelihood estimation method for the copula, the empirical cumulative distribution function for the three margins, and leaving by default the remaining arguments: R> clust <-CoClust(X, dimset = 2:4, noc=2, copula="frank", method. ma="empirical", + method.c="ml", writeout=1) The output is as follows: R> clust To look at specific objects of the resulting list, it is possible to select, among others, the following information: R> clust@"Number.of.Clusters" # Selected number of clusters R> clust@"Dependence"$Param # Estimated copula parameter R> clust@"Data.Clusters" # Clustered data To compare the obtained clustering with the true clustering we can input: The obtained clustering is perfect, CoClust is able to recognize the exact structure underlying the data. Note that the label of each cluster, that is, the order of the margins, is not relevant. The only important aspect is the composition of each cluster, that is, the row indexes in each column of index.clust and their order that has to be such that it reconstructs the exact 3plets across the columns.

Application to wine dataset
In this section, an application of the CoClust package to a real dataset is shown. [32] analyze a set of Italian wines by observing the chemical properties of 178 specimens of three types of wines (Barolo, Grignolino, and Barbera) produced in the Piedmont region in Italy. The data are available in the package sn under the name wines.
Similarly, the other two copula models can be used as in clustC and clustG above. The results appear to not be affected by the type of model used even though, based on the loglikelihood of the copula fitted on the final clustering, the more appropriate model appears to be the Gumbel model with a log-likelihood equal to 527.3022 (compared to 500.8835 for the Frank copula and 429.184 for the Clayton copula).

Conclusion
In this chapter, we describe a copula-based clustering algorithm and its implementation in the R package CoClust. One major advantage of this new package is that it provides an algorithm that is able to cluster multivariate observations by taking into account their underlying complex multivariate dependence structure. Being copula-based, the CoClust algorithm inherits the benefits of the copula. Thus, potentially any type of multivariate dependence structure can be handled and the most appropriate method can be employed to estimate both a probability model for each cluster/margin and the copula model.
The current version of the R package implements the clustering algorithm procedure in the main function CoClust. It enables the user to simultaneously choose the copula model, the estimation method for the margins and for the copula, the aggregation function for constructing the k-plet of observation allocation candidates. Moreover, the range (or set) of the number of clusters from among which the procedure automatically selects the best one and the sample size to be used to select it can be varied.
As with many other software packages, CoClust package is continually being augmented and improved. We are currently investigating possible graphical solutions for the final clustering and implementing some measures to validate the clustering solution. Another future direction includes expanding the functionality of the CoClust package to allow comparing the solution of other clustering algorithms, such as mixture-based clustering and hierarchical clustering methods.