Some standard single parameter bivariate copulas with the range of the dependence parameter and its relation with Kendall’s . with are uniformly distributed variates so that . is the cumulative distribution function (cdf) of the standard normal distribution, is the standard bivariate normal distribution, denotes the standard bivariate student-distribution with degrees of freedom, and the inverse univariate student-distribution function. denotes the “Debye” function .
The aim of this chapter is to present and describe the R package CoClust, which enables implementing a clustering algorithm based on the copula function. The copula-based clustering algorithm, called CoClust, was introduced by Di Lascio and Giannerini in 2012 (Journal of Classification, 29(1):50–75), improved in 2016 (Statistical Papers, p.1–17, DOI 10.1007/s00362-016-0822-3), and is able to find clusters according to the complex multivariate dependence structure of the data-generating process. Hence, among other advantages, the CoClust overcomes the limitations of classic approaches that only deal with linear bivariate relationships. The first part of the chapter briefly describes the clustering algorithm. The second part illustrates the clustering procedure through the R package CoClust and presents numerical examples showing how the main R commands can be used to perform a fully developed clustering of multivariate dependent data.
- clustering algorithm
- copula function
- multivariate dependence structure
- R package
Cluster analysis is an unsupervised classification method that aims to detect a structure within data by assigning a set of objects (observations or variables) into groups, called clusters, whereby objects in the same cluster are in some sense more closely related to each other than objects assigned to different clusters.
Literature on clustering methods is very extensive and different criteria are taken into account to organize and present these methods. One such criterion is the mathematical object of the clustering methods: a distance or dissimilarity measure versus a probability model. In this chapter, we consider the model-based clustering methods that assume the data matrix is generated according to a specific data generating process (henceforth DGP). The classic model-based clustering method in [1, 2] is based on a mixture of multivariate probability distributions, such as the multivariate normal. However, this approach only accounts for the linear dependence between objects so that it inherits all the limitations of the linear correlation coefficient. Hence, we here focus on clustering methods that assume the data matrix is generated by a -dimensional copula  such that each of the clusters is represented by a (continuous) univariate density function and the complex multivariate relationship among clusters is expressed by the copula and its dependence parameter. Specifically, this chapter aims to describe the copula-based clustering algorithm first introduced by  and improved by , presenting in detail its implementation in the
Most clustering algorithms take as input some parameters, such as number of clusters, the distance or density of clusters, or the number of points in a cluster, and a starting classification. Some important benefits of the
The chapter is organized as follows. Section 2 presents the theoretical tools of copula theory essential to understanding the copula-based clustering algorithm introduced and described in Section 3. Section 4 describes in detail the R implementation of the CoClust algorithm and illustrates its use on simulated DGPs. In Section 5, an application to a real dataset is presented, while a brief conclusion follows in Section 6.
2. Copula theory
The copula function [17, 18, 19, 20] was born in the probabilistic metric space with Sklar’s theorem  stating that every joint distribution function can be expressed in terms of marginal distribution function and the copula distribution function as follows:
for all (where denotes the extended real line). According to this theorem, any joint probability function can be split into the margins and a copula , so that the latter represents the association among variables, that is, the multivariate dependence structure of a joint density function [17, 18, 19, 20]:
Such separation determines the modeling flexibility of copulas, since it enables (i) freely choosing the distribution of the margins and, separately, that of the copula, (ii) decomposing the estimation problem into two steps: in the first step, the margins are estimated, in the second step, the copula model is estimated, and (iii) combining different estimation methods or approaches.
The log-likelihood function of is composed of two positive terms as follows:
where the first term involves the copula density and its parameter , and the second involves marginal densities and their parameters , and the whole set of parameters to be estimated is . Thus, it is possible to estimate by exploiting the decomposition into two terms of Eq. (2) [20, Chapter 4] using a sequential two-step maximum likelihood method, called inference for margins (henceforth IFM) . This method estimates the marginal parameters in the first step and uses them to estimate the parameter of the copula function in the second step, in either a full or semi-parametric approach.
A full parametric approach for the IFM method is based on the estimation of the marginal parameters in the first step by the maximum likelihood estimation for each margin:
where each marginal distribution has its own parameters . In the second step, the dependence parameter given for is estimated by:
using the maximum likelihood estimation method.
The IFM method can also be used in a semi-parametric approach  where the margins are modeled without assumptions on their parametric form, that is, through the following empirical cumulative distribution function :
where is computed from with and is the sample size, while the copula parameter is estimated by using the following maximum log-likelihood function:
Note that the scaling factor in Eq. (6) is typically introduced in the nonparametric computation of the margins to avoid numerical problems at the boundary of .
2.1. Copula models
While many different copula models are available in literature (see [18, 19] for details), the Elliptical and Archimedean families are shown to be the most useful in empirical modeling. The Elliptical family includes the Gaussian copula and the t-copula: both are symmetric, exhibit the strongest dependence in the middle of the distribution, and can take into account both positive and negative dependence, since . The Archimedean family enables describing both left and right asymmetry as well as weak symmetry among the margins using the Clayton, Gumbel, and Frank models, respectively. Clayton’s copula has the parameter and as approaches zero, the margins become independent. The dependence parameter of a Gumbel model is restricted to the interval where the value means independence. Finally, the dependence parameter of a Frank copula may assume any real value and as approaches zero, the marginal distributions become independent. According to the type of copula model, the value of has a specific meaning and the magnitudes of the dependence parameter are not comparable across copulas. It is always true that the greater the value of the dependence parameter, the stronger the association among the margins, but since the relationship between and the concordance measures is well known, it is standard to convert to these, for example, to the Kendall’s correlation coefficient. The families of copula models considered here are described in Table 1 and shown in Figure 1 in their bivariate version. Note that here only single parameter copula models are considered.
The copula model selection task is still an open research field. Although various statistical tests enable evaluating whether a specific model is plausible or not, no tool has thus far been recognized as the best. In the copula-based clustering context, this issue can be overcome, since the choice of the type of model would seem less important in terms of the goodness of the final clustering (see , Section 4.3), and a classic information criterion can be used, such as the Bayesian or the Akaike information criterion. This topic is discussed in detail in Section 3.
3. CoClust algorithm
Di Lascio and Giannerini  proposed a clustering algorithm called CoClust that is able to cluster multivariate observations with a complex dependence structure. The basic underlying concept of CoClust is clustering multivariate dependent observations based on the likelihood copula fit estimated on the previously allocated observations. To do so, the CoClust assumes that the data are generated by a multivariate copula function whose arguments represent the clusters, and each cluster is thus generated by a (marginal) univariate density function. The type and strength of multivariate dependence across clusters are modeled through a copula function and its dependence parameter, respectively. Being copula-based, CoClust inherits all the advantages of copula theory, and the multivariate complex dependence structure of the DGP can be taken into account to perform the cluster analysis. However, CoClust in its first version had some significant limitations. For example, it automatically allocated all the observations to the clusters without discarding potentially irrelevant observations, implying a high computational burden. Di Lascio and Giannerini  proposed a new version of the CoClust algorithm that satisfactorily overcomes these limitations. This section hereafter describes the latest version of the CoClust algorithm, which is implemented in the
The starting point of the CoClust algorithm is the standard data matrix :
in which -dimensional objects have to be grouped in groups. CoClust can be applied either to the row or to the column data matrix according to the purpose of the analysis. In both cases, CoClust works with vectors and treats each row (column) of the data matrix as a single element to be allocated to a cluster. The values within a row (or column) vector are treated as independent realizations of the same density function, thus observations for each cluster from the same distribution. Here, CoClust is described as applied to the rows of the data matrix.
3.1. The basic procedure and selection of the number of clusters
The basic idea behind the CoClust consists in a forward procedure that allocates a -plet of the row data matrix at a time, that is, a -dimensional vector for each cluster at a time, and the decision on the allocation of each -plet of rows is based on the value of the log-likelihood of the copula fit. This likelihood is computed by using the -plets already allocated and one allocation candidate, say , by varying the permutations of observations in in order to find, if it exists, the combination that maximizes the copula fit. If the log-likelihood of the copula fitted on the observations already allocated plus the permutation of the selected -plet increases, then the candidate -plet is allocated to the clusters; otherwise, it is discarded, since theoretically it could be either independent from the identified DGP or derive from another DGP.
Before describing the clustering algorithm procedure, two aspects of the CoClust merit a discussion: the construction of the -plet candidate to the allocation and the selection of number of clusters . The -plet of rows candidate to the allocation is constructed based on the following function , which is a sort of multivariate measure of association based on the pairwise Spearman’s correlation coefficient:
where is the subset of row index vectors already selected to compose a -plet, is the subset of the remaining candidate row index vectors to complete it, and is an aggregation function, for instance, the mean, median, or maximum.
As for the selection of , one of the advantages of CoClust with respect to classic clustering techniques is its ability to automatically choose the number of clusters. Indeed, CoClust explores all the possibilities among those given by the user and selects the on the basis of the log-likelihood of the copula estimated on the subsets of -plets allocated up to the user’s predefined step. The technical details are given below.
The main steps of the CoClust algorithm to cluster the row data matrix are described in the following:
by varying the number of clusters in the set of possibilities defined by the user and such that ,
select the subset of -plets of rows, say -plets, that maximizes the log-likelihood of the copula; hence, the number of clusters , that is, the dimension of the copula, is automatically chosen and -plets are already allocated;
select a -plet of rows among those remaining by using the measure in Eq. (9) and estimate copulas by using the observations already clustered and a permutation of the candidate to the allocation;
allocate the permutation of the selected -plet to the clustering by assigning each row to the corresponding cluster only if it increases the log-likelihood of the copula fit, otherwise drop the entire -plet of rows;
repeat steps 3 and 4 until all the observations are evaluated, that is, either allocated or discarded.
At the end of the procedure, we obtain a clustering of clusters each containing a maximum independent observations such that the multivariate dependence relationship across clusters can be revealed. Hence, attention in recovering the multivariate relationships does not rely on the within-cluster relationships, typical of classic clustering methods. A picture of the final CoClust clustering is given in Figure 2. Each cluster is a set of independent and identically distributed realizations from the same marginal distribution while observations across clusters share the same multivariate dependence structure.
Note that since in each step of the procedure non-nested models are compared, that is, copula models with a single dependence parameter, the described log-likelihood based criterion is equivalent to the well-known Bayes information criterion and Akaike information criterion. Finally, note that in the current CoClust version, the selection of the number of clusters is based on a representative subset of observations. Hence, the algorithm chooses the number of clusters by estimating the fits required, where is the range of the number of clusters predefined by the user with and is chosen by the user with . This allows keeping the computational complexity under control since it does not depend on sample size.
3.2. Selecting the copula model
The CoClust algorithm has not been implemented to automatically perform the selection copula model task and requires employing an information criterion a posteriori. The Bayesian information criterion (henceforth BIC) is expressed as follows for a -dimensional copula model :
where is as in Eq. (5) or Eq. (7) with the summation over the number of allocated observations, which equals maximum (i.e., -dimensional vectors) and is the number of parameters. According to , we select the copula model that minimizes the BIC. Similarly, the Akaike information criterion (henceforth AIC) results in:
and can also be used to select the copula model.
3.3. Assessing the CoClust performance
The goodness of the CoClust algorithm in finding the true multivariate clustering structure underlying the data has been extensively investigated. Specifically, the first version of CoClust  was tested on simulated data for different scenarios and compared with model-based clustering [1, 2]. This shows that, both when the DGP is a copula and when it is misspecified, CoClust appears to be able to identify both the true number of clusters and their size in most situations. Moreover, in comparing model-based clustering, CoClust appears better suited to clustering dependent data. In , a more sophisticated Monte Carlo study was carried out, investigating the new features of the current version of the CoClust algorithm. Here, the current version of the algorithm clearly outperforms the previous version by  and CoClust’s ability to find the correct number of clusters and to reconstruct the true -plets by varying the dimension of the copula, the aggregation function , and the copula model appears to be very satisfactory. Furthermore,  also obtained good results in assessing CoClust’s ability to drop from the clustering observations that are independent of the true DGP as well as distinguishing two different DGPs in the same dataset.
As for real data applications, the CoClust algorithm has been successfully applied to several datasets. In relation to biomedical applications,  apply the CoClust to microarray data to formulate hypotheses on the possible co-regulation and functional relations between genes,  use the copula-based clustering method to identify biologically and clinically relevant groups of tumor samples, and  attempt to identify organ type from cancer cell lines from tumors. Applications in other fields include , where the purpose of the analysis is investigating changes in EU country diets in accordance with common European policies and guidelines on healthy diets, and , where CoClust is used to investigate the geographic distribution of (annual maxima) rainfall measurements.
4. The R implementation of the CoClust algorithm
The copula-based clustering algorithm procedure is implemented in the
and then it must be loaded through the usual code:
The code of the
4.1. List of functions and subroutines
The main R function is
4.2. The CoClust function
The main function of the package
fit a variety of copula models (by setting the argument
copula) with different types of estimation procedures for margins and for copulas (arguments method.maand method.c,respectively); specifically, all the copula models belonging to the Elliptical and the Archimedean family described in Section 2 can be estimated through the estimation methods implemented in the Rpackage copula[27, 28, 29, 30] that are maximum pseudo-likelihood estimators based on two different variance estimators, the inversion of Kendall’s estimator and the inversion of Spearman’s estimator; as for the margins, two different estimation methods have been implemented, one parametric and one nonparametric: the maximum likelihood method as in Eq. (4) and the empirical cumulative distribution function in Eq. (6);
set the range or set of dimensions for the copula model, that is, number of clusters, for which the function tries the clustering (argument
set the dimension of the sample units used for selecting the number of clusters (argument
select the combination function of the pairwise Spearman’s used to select the -plets among the mean, the median, or the maximum (argument
fun) as defined in Eq. (9);
The typical use of the function
The main output of the function
Number of Clusters: the number of selected and identified clusters;
Index Matrix: a matrix where n.obs is the number of observations put into each cluster; the matrix contains the row indexes of the observations of the data matrix
m(Eq. (8)) and in the last column the log-likelihood of the copula fit;
Data Clusters: the data matrix of the final clustering; each column contains the observations allocated in a cluster;
Dependence: a list containing:
Model: the copula model used for the clustering; Param: the estimated dependence parameter between/among clusters; Std.Err: the standard error of Param; P.val: the p-value associated to the null hypothesis ;
LogLik: the maximized log-likelihood copula fit;
Est.Method: the estimation method used for the copula fit;
Opt.Method: the optimization method used for the copula fit;
LLC: the value of the log-likelihood criterion for each in
Index.dimset: a list that, for each in
dimset, contains the index matrix of the initial set of observations used to select the number of clusters, together with the associated maximized log-likelihood copula fit.
4.3. Simulated examples
This section shows how to use the
In this example, we build a 3-variate joint density function through a 3-dimensional Frank copula with dependence parameter such that the Kendall’s and three different margins: a Gaussian with parameters , a Gamma with shape and rate, respectively, set to 3 and 4, and a Beta with parameters . To do so, we employ the function
Code to generate the example dataset is given in the following. We first define the DGP:
and then generate the data and save the true clustering in index.true:
Note that n is the number of observations for each margin.
We apply the
The output is as follows:
To look at specific objects of the resulting list, it is possible to select, among others, the following information:
To compare the obtained clustering with the true clustering we can input:
to obtain as follows:
The obtained clustering is perfect, CoClust is able to recognize the exact structure underlying the data. Note that the label of each cluster, that is, the order of the margins, is not relevant. The only important aspect is the composition of each cluster, that is, the row indexes in each column of
To apply the
In this example, we use a different DGP from the copula, thus showing the use of CoClust in the misspecification case. Specifically, a data matrix is drawn from a three-dimensional skew-normal distribution through the R package
On the console, it is possible to monitor the number of observations already allocated (argument writeout). Indeed, while CoClust runs, the following information appears on the console:
To look at the obtained clustering and its details, one has to input:
Note that when the number of -plets to be allocated is not small, the goodness of the obtained clustering is difficult to determine. Hence, for example, two functions can be exploited to assess the quality of the final clustering:
and the two functions
The obtained values inform us that of -plets deriving from the true DGP are correctly allocated and of -plets in the final clustering are correctly allocated.
5. Application to wine dataset
In this section, an application of the
A subset of randomly selected wines has been analyzed through CoClust by varying the number of clusters from 2 to 7 and the copula model among the three models of the Archimedean family. Since Grignolino is a type of wine with characteristics between those of Barolo and Barbera, we work with a sample of only these two last types of wines. Code is as follows:
To evaluate the final clustering obtained with a specific copula model, say the Frank model, and to compare it with the true classification of the 12 selected wines, the code is as follows:
CoClust selects 6 clusters and allocates to each cluster the two types of wines. Thus, across clusters, we can perfectly recognize the two types of Italian wines and in each cluster we have different wines with different (e.g., independent) chemical characteristics.
Similarly, the other two copula models can be used as in
In this chapter, we describe a copula-based clustering algorithm and its implementation in the
The current version of the R package implements the clustering algorithm procedure in the main function
As with many other software packages,
The author acknowledges the support of the Free University of Bozen-Bolzano, Faculty of Economics and Management, via the project “Aggregation functions for Innovation and Data Analysis (AIDA)” and Professor Simone Giannerini, University of Bologna, with whom the first version of the package was developed.
The following two functions are useful to evaluate the goodness of the final clustering obtained through the CoClust algorithm when true clustering or benchmark clustering is available. The arguments of these two functions are