Many real-world datasets can be naturally described by multiple views. Due to this, multi-view learning has drawn much attention from both academia and industry. Compared to single-view learning, multi-view learning has demonstrated plenty of advantages. Clustering has long been serving as a critical technique in data mining and machine learning. Recently, multi-view clustering has achieved great success in various applications. To provide a comprehensive review of the typical multi-view clustering methods and their corresponding recent developments, this chapter summarizes five kinds of popular clustering methods and their multi-view learning versions, which include k-means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. These clustering methods are the most widely employed algorithms for single-view data, and lots of efforts have been devoted to extending them for multi-view clustering. Besides, many other multi-view clustering methods can be unified into the frameworks of these five methods. To promote further research and development of multi-view clustering, some popular and open datasets are summarized in two categories. Furthermore, several open issues that deserve more exploration are pointed out in the end.
- multi-view clustering
- multi-view k-means
- multi-view spectral clustering
- multi-view matrix factorization
- tensor decomposition
- deep learning
Clustering is one of the most critical unsupervised learning techniques, which has been widely applied for data analysis, such as social network analysis, gene expression analysis, heterogeneous data analysis, and market analysis. The goal of clustering is to partition a dataset into several groups such that data samples in the same group are more similar than those in different groups. Clustering plays an important role in mining the hidden patterns. However, most of the existing clustering algorithms are designed for single-view data.
With the rapid development of Internet and communication technology (ICT), the accesses to extract data are dramatically extended. That is, data can be collected from multiple sources or multiple facets. In such setting, each datum is associated with much richer information, which results in the requirement that to mine the intrinsic and valuable patterns hidden in the data, it is a necessity to take full advantage of the information contained in multiple sources. This issue is formally referred to as multi-view learning. To be more specific, each view corresponds to one source of information. For example, web pages can be described by both the page-contents (one view) and the hyperlink information (another view). Besides, different facets of a datum can also be treated as different views. For instance, an image can be characterized by its shape, color, and location.
Obviously, integrating the information contained in multiple views can bring great benefits for data clustering. The most straightforward way to utilize the information of all views is to concatenate the data features of each view together and then perform the traditional clustering methods such as -means. However, such a method lacks the ability to distinguish the different significance of different views. That is, the important views and less important views are treated equally, which may degrade the ultimate performance severely. To take better advantage of the multi-view information, the ideal approach is to simultaneously perform the clustering using each view of data features and integrate their results based on their importance to the clustering task. Formally, this approach is known as multi-view clustering.
As an emerging and effective paradigm in data mining and machine learning, multi-view clustering refers to the clustering of the same class of data samples with multi-view representations, either from various information sources or from different feature generators. It is clear that if the clustering method cannot cope appropriately with multi-views, these views may even degrade the performance of multi-view clustering. To make use of multi-view information to improve clustering results, there are two main challenges to overcome. The first one is how to naturally ensemble the multiple clustering results of all the views. The second one is how to learn the importance of different views to the clustering task. In addition, these two issues should be figured out simultaneously. Thus, to achieve these goals, new clustering objective function should be designed, followed by the new solving method.
Multi-view clustering was first studied by Bickel and Scheffer  in 2004. They extended the classic -means and expectation maximization (EM) clustering methods to the multi-view environment to deal with text data with two conditionally independent views. Based on this seminal work, a variety of multi-view clustering methods have been proposed over the past two decades [2, 3, 4]. Since covering all the proposed methods in one chapter is hard, to provide a comprehensive review of the typical multi-view clustering methods and their corresponding recent developments, we summarize five kinds of popular clustering methods and their multi-view learning versions, which include -means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. This is based on the consideration that these clustering methods are the most widely employed algorithms for single-view data, and lots of efforts have been devoted to extending them for multi-view clustering. Besides, many other multi-view clustering methods can be unified into the frameworks of these five methods. Therefore, when readers become familiar with these five multi-view clustering methods, they can capture the core ideas of other multi-view clustering methods easily. This chapter is self-contained, which follows a line of introduction from the preliminaries of these clustering methods for single-view data to their variant forms for multi-view clustering.
The remainder of this chapter is organized as follows. Section 2 describes the benefits of multi-view clustering. Section 3 details the aforementioned five multi-view clustering methods. Section 4 summarizes two kinds of popular open datasets. Several open issues are illustrated in Section 5. Section 6 concludes this chapter.
2. Benefits of multi-view clustering
Compared with the clustering methods that are implemented on single-view data, multi-view clustering is expected to obtain more robust and novel partitioning results by exploiting the redundant and complementary information in different views , as stated in the following sections.
2.1. Benefit one: accurate description of data
It is obvious that single-view data may contain incomplete knowledge, while multi-view data usually contains complementary and redundant information, which results in a more accurate description of the data. For example, it may fail to identify the intrinsic community structures of a social network via just leveraging the friendships. However, if more information such as users’ demographics can be obtained, it is more inclined to find out the implicit relationships between users.
2.2. Benefit two: reducing noises of data
Even when the information contained in single-view data is complete, there may exist some unavoidable noises. It is apparent that data cleaning is one critical issue in data analysis, which can tremendously affect the performance of clustering algorithms. It is quite hard and costly to remove all the noises of data, and thus single-view noisy data usually leads to unsatisfactory clustering results. On the other hand, multi-view clustering is able to circumvent the side effect of noises or corrupted data in each view and emphasize the common patterns shared by multi-view data.
2.3. Benefit three: wider range of applications
There is no doubt that all the multi-view clustering methods can be applied to single-view data. However, many clustering tasks are impossible to implement by single-view clustering due to its limitations. For example, data with multiple modalities is becoming more and more common and heterogeneous information networks are gaining increasing popularity as well. These types of data naturally fit into multi-view learning, while cannot be settled by single-view learning methods appropriately. In all, the complementary property among multi-view data can overcome the limitations of single-view data and expand their application areas.
3. Multi-view clustering methods
Due to the widespread use of multi-view datasets in practice, many realistic applications are accomplished by multi-view learning methods, such as community detection in social networks, image annotation in computer vision, and cross-domain user modeling in recommendation systems . Meanwhile, based on the seminal work of Bickel and Scheffer , plenty of multi-view clustering methods have been proposed [2, 3, 5]. As explained in Section 1, this chapter seeks to review five kinds of typical clustering methods and their multi-view versions, which include -means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. All of these five methods are popular methods for single-view clustering. Although there are some other multi-view clustering methods not contained in this chapter, such as the canonical correlation analysis (CCA)-based multi-view clustering methods , the DBSCAN-based multi-view clustering methods , and the lower dimensional subspace-based multi-view clustering methods , most of them can be unified into the frameworks of these five involved methods. For instance, a pair-wise sparse subspace representation model for multi-view clustering proposed in  can be unified into the framework of matrix factorization.
3.1. Multi-view clustering via -means
-means is one of the most popular clustering algorithms with a history of more than 50 years . Except for its simplicity, -means has a good potential to deal with large-scale datasets. Owing to these properties, -means has been successfully used in various topics, including computer vision, social network analysis, and market segmentation, to name but a few. Although it has been studied deeply over the past few decades, many variants of -means are put forward continuously [12, 13, 14, 15].
3.1.1. Preliminaries of -means
As a classic clustering algorithm, -means employs prototype vectors (i.e., centers or centroids of the clusters) to characterize each data sample and minimizes a sum of squared loss function to find these prototypes. Consider a dataset denoted by , where represents the attribute (feature) vector of the -th data sample . In order to partition the dataset into disjoint clusters, denoted by , -means tries to optimize the following objective function:
where is an indicator variable with if and 0 otherwise and is the -th prototype vector, i.e., the -th cluster center.
As can be seen, Eq. (1) adopts the Euclidean distance to measure the similarities between data samples. However, there are many data structures or data distributions in real world. Thus, it is not always suitable to apply this basic form of -means to accurately identify the hidden patterns of datasets. What is more, some datasets may be not separable in the low-dimensional space. Recently, kernel method has been of wide concern in the field of machine learning. By introducing a kernel function, the original nonlinear datasets are mapped to a higher dimensional reproducing kernel Hilbert space. In the new space, the datasets become linearly separable. For this reason, the kernel -means algorithm [16, 17] has been proposed. It is just a generalization of the standard -means algorithm and has the following objective function:
where is a nonlinear transformation function. Define a kernel function with . Then, Eq. (2) can be rewritten into the kernel form as below:
With the aid of the kernel function, there is no need to explicitly provide the transformation function . This is because, for certain kernel function, the corresponding transformation function is intractable. However, the inner products in the kernel space can be easily obtained according to the kernel function.
3.1.2. Basic form of multi-view -means
Both the -means and the kernel -means described above are designed for single-view data. To solve the multi-view clustering problem, some new objective functions should be developed. Assume that there are views in total. Let denote the data of all the views. It is obvious that different views should have different contributions according to their conveyed information. To achieve this goal, it is straightforward to modify the standard -means to make it applicable in the multi-view environment with a new objective function as follows:
where is the weight factor for the -th view, is a parameter used to control the weight distribution, and corresponds to the objective function (i.e., loss function) of the -th view:
Similarly, the objective function of the multi-view kernel -means can be obtained, which is omitted here. Note that finding the optimal solution of Eq. (4) is an NP-hard problem; thus, some iterative algorithms are developed according to the greedy strategy. One basic iterative algorithm works in a two-stage manner: (1) updating the clustering for given weights and (2) updating the weights for given clusters; see  for details.
Denote as the Frobenius norm of a given matrix , i.e., . Then, Eq. (4) can be easily transformed into a matrix form as shown in the following:
where denotes the centroid matrix for the -th view and denotes the clustering indicator matrix with the element being . Note that all the views share a common clustering indicator matrix .
3.1.3. Variants of multi-view -means
The basic formulations of multi-view -means shown in Eqs. (4) and (6) do have some drawbacks. For example, it assumes that all the views are sharing a common clustering indicator matrix . However, the structure information contained may be very limited or even lost in some views. In such case, the performance will be severely affected if all the views share a common clustering indicator matrix. To tackle the issues, many variants of multi-view -means clustering algorithms have been proposed in recent years. Instead of the -norm, the structured sparsity-inducing norm, i.e., the -norm, is adopted to strengthen the basic multi-view -means, in the hope that the effect of outlier data samples will be reduced . In , a -means-based dual-regularized multi-view outlier detection method (DMOD) is proposed to identify the cluster outliers and the attribute outliers simultaneously, which is based on a novel cross-view outlier measurement criterion. Moreover, in the DMOD model, each view is associated with a particular clustering indicator matrix, and another alignment matrix is introduced to enforce the consistency between different views. An automated two-level variable weighting clustering algorithm, called TW--means, is developed in . TW--means is able to compute weights for each view and each individual attribute simultaneously. More specifically, in this algorithm, to identify the compactness of the view, a view weight is assigned to each view, and an attribute weight is assigned to each attribute in the view to identify the importance of the attribute. Both view weights and attribute weights are employed in the distance function to determine the cluster structures of data samples. Similar strategies have also been taken in [22, 23] to learn more robust multi-view -means models.
As aforementioned, it is NP-hard to find the optimal solution of the multi-view -means clustering problem. The greedy iterative algorithm has a high risk of getting stuck in local optima during the optimization. Recently, the self-paced learning has been used to alleviate this problem. The general self-paced learning model consists of a weighted loss function on all data samples and a regularizer term imposed on the weights of data samples. By gradually increasing the penalty on the regularizer, more data samples are automatically added into consideration from “easy” to “complex” via a pure self-paced approach. In this, Xu et al.  present a new multi-view self-paced learning (MSPL) algorithm for clustering based on multi-view -means. MSPL learns the multi-view model by not only progressing from “easy” to “complex” data samples but also from “easy” to “complex” views. The objective function of MSPL is quite succinct, which is shown in Eq. (7).
where denotes the weights of data samples in the -th view, , and denotes the regularization term on demand.
3.2. Multi-view clustering via spectral clustering
Spectral clustering is built upon the spectral graph theory. In recent years, spectral clustering has become one of the most popular clustering algorithms and shown its effectiveness in various real-world applications ranging from statistics, computer sciences to bioinformatics. Due to its adaptation in data distribution, spectral clustering often outperforms traditional clustering algorithms such as -means. In addition, spectral clustering is simple to implement and can be solved efficiently by standard linear algebra.
3.2.1. Preliminaries of spectral clustering
Spectral clustering is closely related to the minimum cut problem of graphs. It first performs dimensionality reduction on the original data space by leveraging the spectrum of the similarity matrix of data samples and then performs -means on the low-dimensional space to partition data into different clusters. Therefore, for a set of data samples, a similarity matrix should be constructed at first. Typically, each data sample is treated as a node of a graph and each relationship between data samples is regarded as an edge in the graph. Besides, each edge is associated with a weight. It is obvious that the value of the edge weight between two far-away data samples should be low and the value between two close data samples should be high. For a given dataset , let be the generated undirected weighted graph, where denotes the set of nodes representing the data samples and denotes the set of edges representing the relationships between data samples. The similarity matrix is a symmetric matrix with each element representing the similarity between and . There are three popular approaches to construct graph , that is, the -neighborhood graph, the -nearest neighbor graph, and the fully connected graph (see details in ). To partition into disjoint subgraphs (clusters), the minimum cut problem requires that the edge weights across different clusters are as small as possible, while the total edge weights within each cluster are as high as possible.
According to the above graph cut theory, two popular versions of spectral clustering are developed, i.e., the ratio cut (RatioCut) and the normalized cut (Ncut). The classical relaxed form of the RatioCut  is shown as below:
where computes the trace of a matrix, is the clustering indicator matrix, is an identity matrix, and is the graph Laplacian matrix defined as . Here, is a diagonal matrix with . The objective function of Ncut  is similar to Eq. (8) by replacing by the normalized Laplacian matrix . Both RatioCut and NCut can be solved efficiently by the eigenvalue decomposition (EVD) of or .
3.2.2. Basic form of multi-view spectral clustering
Multi-view spectral clustering is able to learn the latent cluster structures by fusing the information contained in multiple graphs. Similar to multi-view -means, it is not hard to extend the basic spectral clustering to the multi-view environment. Given a dataset with views, graphs and the corresponding Laplacian matrices can be constructed.
Kumar et al.  firstly present a multi-view spectral clustering approach, which has a flavor of co-training idea widely used in semi-supervised learning. It follows the consistency of multi-view learning that each view gives the same labels for all data samples. So it can use the eigenvector of one view to “label” another view and vice versa. For example, via computing two views’ eigenvectors, say and , the clustering result of can be used to modify the graph similarity matrix , and then the clustering result of can be used to modify the graph similarity matrix . For more than two views, the same strategy can be applied. Kumar et al.  further propose a multi-view spectral clustering approach using co-regularization idea that makes the clustering results of different views agree with each other. The co-regularization form is stated as the disagreement between clustering results of two views: . Then the goal is to minimize the disagreement to achieve the consistency between views with the following objective function:
where represents the degree of disagreement between the -th view and the -th view. From another perspective, all the views sharing a common indicator matrix is also rational according to the consistency requirement. So the model in Eq. (9) can be rewritten as
where controls the degree of disagreement between and .
3.2.3. Variants of multi-view spectral clustering
The basic form of multi-view spectral clustering achieves the basic goals of multi-view learning. However, some issues have not yet been considered. For instance, the weight parameter in Eq. (9) needs to be set manually. To make up this issue, it is necessary to adaptively compute the weight of each view. Xia et al.  assume that each view has a weight representing its importance and the weight distribution should be sufficiently smooth. They further consider a unified indicator matrix across all views, which can be fulfilled via exploring the complementary property of different views. To this end, they develop a novel model as follows:
The model above needs a manually specified parameter to adjust the weights of different views, which is sometimes intractable. Thus, Nie et al.  propose a parameter-free auto-weighted multiple graph learning method (AMGL), wherein the weight parameter is replaced by . Thus, AMGL does not require additional parameters, and can be self-updated. To avoid the considerable noise in each view which often degrades the performance severely, Xia et al.  propose a robust multi-view spectral clustering (RMSC) method via low-rank and sparse decomposition. In RMSC, a novel Markov chain is designed for dealing with the noise. First, the similarity matrix and the corresponding transition probability matrix are computed. Then, the row-rank latent transition probability matrix and the deviation error matrix are constructed via low-rank and sparse decomposition. Finally, based on the transition probability matrix , the standard Markov chain method is applied for partitioning data into clusters. Note that the methods above have a high cost in optimization computation. There are numerous variables that need to be updated and the derivation process is also extremely complex during the optimization. To overcome this limitation, Chen et al.  present a novel variant of the Laplacian matrix named block intra-normalized Laplacian defined as follows, without the linear combination of multiple Laplacian matrices.
where denotes the within Laplacian matrix of views and denotes the across Laplacian matrix between different views. Based on , the block intra-normalized Laplacian matrix is then defined as , where is a block diagonal matrix with the -th block being . By proving that the multiplicity of the zero eigenvalue of the constructed block Laplacian matrix is equal to the number of clusters , the eigenvectors of the block Laplacian matrix can be used for clustering via the classical form of spectral clustering. At the end, the lower and upper bounds of the optimal solution are also established. See  for more details.
3.3. Multi-view clustering via matrix factorization
In the fields of data mining and machine learning, matrix factorization (MF) is an effective latent factor learning model. Given a data matrix , MF tries to find two low-rank factor matrices and whose multiplication can well approximate it, i.e., . MF has shown many promising applications in real world, such as information retrieval, recommendation system, signal processing, document analysis, and so on. Usually, the nonnegativity constraints are enforced to the factor matrices to promote the interpretability of the MF models. Therefore, in this part, we focus on the introduction of the nonnegative MF (NMF)-related clustering models. For a comprehensive review of NMF-based models and applications, please refer to .
3.3.1. Preliminaries of matrix factorization
As is well known, there are many matrix factorization models, including the singular value decomposition, Cholesky decomposition, LU decomposition, QR decomposition, and Schur decomposition. These factorization models either have too strict restrictions on the factor matrices or lack the ability to be applied to data analysis. Due to the wide applications of NMF in recommending systems, NMF has drawn much attention in both academia and industry. In fact, NMF can be regarded as an extension of the standard -means algorithm by relaxing the constraints imposed on the clustering indicator matrix. For a given dataset , NMF seeks to learn a basis matrix and a coefficient matrix via optimizing the following objective function:
where can be considered as the cluster centroid matrix and can be treated as a “soft” clustering indicator matrix. The objective function above is not convex in and ; therefore, it is impractical to find the global optima. Typically, there are two methods to solve Eq. (13). The first one is the gradient descent method . The other one is the multiplicative method  where the iterative updating rules are as follows:
where and denote the element-wise multiplication and division, respectively. It is noteworthy that there are many other criterions to measure the difference between and , such as the -norm, the -norm, and the Kullback-Leibler divergence (a.k.a. relative entropy). For these criterions, the updating rules can be derived similarly.
3.3.2. Basic form of multi-view matrix factorization
The hypothesis behind multi-view clustering is that different views should admit the same underlying clustering structures of the datasets. That is, the coefficient matrices learned from different views should be as consistent as possible. To this end, a soft regularization term is introduced to enforce the coefficient matrices of different views toward a common consensus . For a given dataset with views, the following objective function can be derived to partition into clusters:
where is a consensus matrix that characterizes the intrinsic clustering structures of datasets among all views and is the parameter used to tune both the relative importance of different views and the contribution between the first reconstruction error term and the second disagreement term. Note that Eq. (15) does not require that all the views share a common ; thus, this model is more robust to low-quality views, i.e., the effect of low-quality views is reduced by setting the corresponding to be small enough.
Instead of enforcing a rigid common consensus constraint on all the views as in Eq. (15), another form of basic multi-view NMF for clustering is the pair-wise CoNMF model , which imposes similarity constraints on each pair of views. Through the pair-wise co-regularization, it is expected that the coefficient matrices learned from two views can complement with each other during the factorization process. And therefore, high-quality clustering results can be yielded. The co-regularization objective function of the pair-wise CoNMF model is defined intuitively as follows:
where is the parameter employed to combine the factorization of different views and is the parameter used to denote the weight of similarity constraint on and . As the column vector of the coefficient matrix represents a cluster, when adopting the vector-based -norm, each element of gives the cosine similarity between two clusters. Obviously, in the multi-view environment, the cluster similarity between different views should also be consistent, which results in the cluster-wise CoNMF model. Cluster-wise CoNMF replaces the pair-wise regularization term in Eq. (16) by the following cluster-wise regularization term:
Similar to the optimization of the standard single-view NMF model, all the three basic multi-view NMF clustering models can be optimized via the multiplicative updating rules.
3.3.3. Variants of multi-view matrix factorization
As the locality preserving learning and the manifold learning have been shown very important to promote the performance of clustering algorithms, Cai et al.  propose a graph (or manifold) regularized NMF model GNMF for single-view clustering with satisfying performance. Note that the aforementioned multi-view NMF models cannot preserve the local geometrical structures of the samples. To overcome this limitation, a multi-manifold regularized NMF model (MMNMF) is proposed in . MMNMF incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to preserve the local geometrical structures of the multi-view data space. The multi-manifold regularization has also been considered in . Moreover, the correntropy-induced metric (CIM) is adapted to measure the reconstruction error, since CIM has achieved excellent performance in many applications. CIM is also insensitive to large errors that are mainly introduced from heavy noises. A much simpler formulation of the manifold regularized multi-view NMF model is developed in . Without the explicit constraint that enforces a rigid common manifold consensus, an auxiliary matrix is involved to add constraints on the column sums of the basis matrix such that the coefficient matrix is comparable. A weighted extension of multi-view NMF is presented in  to address the image annotation problem. In this model, two weight matrices are introduced. One weight matrix is used to bias the factorization toward improved reconstruction for rare tags. The other weight matrix gives more weight to images containing rare tags and is applied to all views. A weighted extension of the pair-wise CoNMF model has also been developed in  to handle those attributes that are unobserved in each data sample so as to resolve the sparseness problem in all views’ matrices. For the realistic cases that many views suffer from missing of some data samples resulting in many partial examples, Li et al.  firstly devise a partial multi-view clustering method to handle this problem. A multi-incomplete-view clustering method MIC  is also designed to deal with the incompleteness of the views. MIC is built upon the weighted NMF model with a -norm regularization. Zhang et al.  further propose a constrained multi-view clustering algorithm for unmapped data in the framework of NMF. The proposed algorithm uses inter-view constraints to establish the connections between different views.
Due to its great interpretability and high efficacy, NMF has been widely employed for graph clustering . In such setting, the data matrix is replaced by the adjacency matrix . In many applications, graph data may be collected from heterogeneous domains or sources. Integrating multiple graphs has been shown to be a promising approach to improve the graph clustering accuracy. Clearly, multi-view NMF is suitable for multiple graph processing. In , a flexible and robust NMF-based framework, named co-regularized graph clustering (CGC), is developed to address the multi-domain graph clustering problem. CGC supports many-to-many cross-domain node relationships, and it also incorporates weights on cross-domain relationships. Besides, CGC allows partial cross-domain mapping so that graphs in different domains may have different sizes. Considering the fact that in many real-world applications, different graphs have different node distributions, the assumption that the multiple graphs share a common clustering structure does not hold. Given this, Ni et al.  develop a novel two-phase clustering method NoNClus, based on the NMF framework. At first, a main graph is constructed via modeling the similarity between different domains. Then, the main graph is utilized to regularize the clustering structures in different domain-specific graphs. In the NonClus model, multiple underlying clustering structures can co-exist among domain-specific graphs, while for similar domains, the corresponding clustering structures should be as consistent as possible.
3.4. Multi-view clustering via tensor decomposition
In this part, we analyze multi-view clustering from a multilinear algebra perspective and present several novel multi-view clustering algorithms (note that the notations used in this part are self-contained). Tensor is known as a multidimensional matrix or multiway array . In multi-view research field, data can be naturally modeled as a third-order tensor with objects, features, and view dimensions. An intuitive way is to compact different views along the view dimension of the tensor (see Figure 1). Another widely adopted way is to transform each feature matrix to a similarity matrix before compacting them.
3.4.1. Preliminaries of tensor decomposition
In the field of data mining and machine learning, tensor decomposition is an emerging and effective tool for processing multi-view data. In this section, some basic knowledge on tensors and tensor decomposition methods is provided. We refer the readers to [51, 52] for a comprehensive understanding of these topics.
Let be an -order tensor of size . The mode-matricization of is denoted as an matrix , which is obtained by arranging the mode-fibers to be the columns of the matrix . The -mode multiplication can be manipulated as matrix multiplication , where and . The Frobenius norm of a tensor is the sum of the squares of all its elements . The tensor is a rank-one tensor if it can be written as the outer product of vectors, i.e., , where represents the vector outer product.
184.108.40.206. CP decomposition
The idea of expressing tensor as the sum of a number of rank-one tensors comes from the study of Hitchcock . Then, Cattell  proposed the idea of parallel proportional analysis. The popular CP decomposition comes from the ideas of Carroll and Chang  (canonical decomposition) and Harshman  (parallel factors). Taking a third-order tensor as an example, the CP decomposition tries to approximate tensor with components of rank-one tensor, i.e.,
where , , and . For simplicity, we denote , , , and as the CP decomposition of .
220.127.116.11. Tucker decomposition
The idea of Tucker decomposition is introduced by Tucker . The Tucker decomposition is a form of higher-order singular value decomposition (HOSVD) . It decomposes a tensor into a core tensor multiplied by several orthogonal matrices along each mode, i.e.,
The cutting-edge technique for calculating the factor matrices is proposed in .
3.4.2. Tensor decomposition-based multi-view clustering
In multi-view clustering, the goal is to find out some meaningful group of objects from the data. The above CP decomposition naturally divides the multi-view data into several components, which can be seen as the clusters. Thus, it can be directly applied to solve multi-view clustering problems. For a given dataset with views, where of each view takes value from , can be formulated as a third-order tensor . In this part, a variant of CP decomposition is introduced first, which is quite straightforward. Then we shed light on the relations between several classic multi-view spectral clustering methods and the Tucker decomposition.
18.104.22.168. Total variation based CP (TVCP)
In some clustering problems, a consecutive range of time points is non-negligible. For example, in the dataset with authors, publications, and a sequence of time points, we are interested in figuring out which group of authors work in the same topics during a period of time. Chen et al.  propose a total variation based tensor decomposition method (TVCP) for the constraint on a period of consecutive time points. The total variation regularizes the time factor to obtain a piece-wise constant function w.r.t. time points. Owing to the piece-wise constant function, the decomposition can be relatively consistent in a cluster and separated between clusters. The TVCP model is formulated as follows:
where is the first-order difference matrix such that and for , and the other elements are zeros, is a positive regularization parameter, and denotes the -norm. The first term corresponds to the CP decomposition of , and the second term constrains the time mode () to be a piece-wise constant function.
22.214.171.124. Relations between Tucker decomposition and spectral clustering
Liu et al.  propose a framework of multi-view clustering via tensor decomposition, mainly the Tucker decomposition. According to the framework, the common type of multi-view spectral clustering is equivalent to a Tucker decomposition problem as follows:
Another form of multi-view spectral clustering can also be written as a Tucker problem:
With this framework, variety of spectral clustering problems can be solved by a tensor decomposition algorithm. We can see the strong connection between them as well as the strong capability of tensor methodology.
Canonical correlation analysis is designed to inspect the linear relationship between two sets of variables . In multi-view learning, a typical approach is to maximize the sum of pair-wise correlations between different views . Without loss of high-order correlations, Luo et al.  propose a tensor canonical correlation analysis (TCCA), which is equivalent to CP decomposition of the correlation tensor. Khan et al.  propose a Bayesian extension of CP decomposition for multiple coupled tensors sharing common latent factors.
3.5. Multi-view clustering via deep learning
With the third wave of artificial intelligence, deep learning is gaining increasing popularity in recent years. Deep learning has demonstrated excellent performance in many real-world applications, such as face recognition, image annotation, natural language processing, object detection, customer relationship management, and mobile advertising. Typically, deep learning models are composed of multiple nonlinear transformations and thus can learn a better feature representation than traditional shallow models . However, deep learning requires labeled training data to learn the models, which limits its application in data clustering for the reason that training data with cluster labels are not available in many cases. Despite the hardness, there are some works devoted to adjusting shallow clustering models for deep learning. Here, we introduce two popular deep clustering models and their extensions to the multi-view environment.
3.5.1. Deep auto-encoder
An auto-encoder  is an artificial neural network adopted for unsupervised learning, the goal of which is to learn a representation for each data sample. An auto-encoder always consists of two parts: the encoder and the decoder. The encoder plays the role of a nonlinear mapping function that can map each data sample to a representation space. The decoder demands accurate data reconstruction from the representation generated by the encoder. Auto-encoder has been shown to be similar to spectral clustering in theory; however, it is more efficient and flexible in practice. The auto-encoder can be easily deepened via adding more encoder layers and corresponding decoder layers. Figure 2 (a) gives an example of the framework of the deep auto-encoder.
Although auto-encoder can learn a compact representation for each data sample, it contributes little to clustering since it does not require that the representation vectors of similar data samples should also be similar. To make the learned feature representation better capture the cluster structures, many variants of deep auto-encoder models have been proposed. In , a novel regularization term that is similar to the objective function of -means is introduced to guide the learning of the mapping function. In this way, the learned feature representation is more stable and suitable for clustering. In , a deep embedded clustering method is proposed to simultaneously learn feature representations and cluster assignments using deep auto-encoders. These deep clustering models are designed for single-view data. For deep multi-view clustering, the learned feature representations should not only capture the cluster structure of each single view but also implement a consensus between different views. To this end, a common encoder is utilized to extract the shared feature representation for all views, and different decoders are used to reconstruct view-specific input data samples . In , an extension of CCA based on deep neural networks is proposed to learn a shared representation of two views. In fact, the feature representations of the two views are not exactly the same, but their correlations are maximized. Following this line, the deep canonically correlated auto-encoder (DCCAE) is developed in . DCCAE simultaneously optimizes the canonical correlation between the learned feature representations and the reconstruction errors of the auto-encoders. Benton et al.  further extend the deep CCA model for multiple views.
3.5.2. Deep matrix factorization
Another line of developing deep clustering models is deepening the MF models. As shown earlier, MF, especially NMF, has demonstrated outstanding performance in many applications. Thus, it is worth building a deep structure for MF in the hope that better feature representations can be obtained to facilitate clustering. Figure 2(b) illustrates an example of the framework of the deep MF models. Compared to the deep auto-encoders, both deep MF and deep auto-encoders are trying to minimize the reconstruction errors. However, unlike deep auto-encoders, the mapping function of deep MF is linear.
The first nonnegative deep network based on NMF is proposed in  for speech separation. This architecture can be discriminatively trained for optimal separation performance. Then Li et al.  propose a novel weakly supervised deep MF model to uncover the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. In , a deep semi-NMF model is further developed for learning latent attribute representations. Semi-NMF is a popular variant of NMF by relaxing the factorized basis matrix to be real-valued. This practice makes semi-NMF have much wider applications than NMF since the datasets in real world may contain complex information, for instance, the attributes may be mix-signed. Considering the fact that these deep MF models are trying to factorize the basis matrix hierarchically alone, Qiu et al.  further propose a deep orthogonal NMF model which can decompose the coefficient matrix hierarchically. This model is able to learn higher-level representations for clusters. These deep MF models have achieved great success in data clustering for single-view data. However, they are seldom utilized for multi-view clustering. A recent work  attempts to extend the deep semi-NMF model for multi-view clustering, which can dissemble unimportant factors layer by layer and generate an effective consensus representation in the last layer. Another work  proposes to address the incomplete multi-view clustering problem via deep semantic mapping. The proposed model first projects all incomplete multi-view data to a unified representation in a common subspace, which is further executed by standard shallow NMF for clustering.
4. Open datasets
No one can make bricks without straw. In this section we will first list two kinds of open datasets that can be used in multi-view clustering, i.e., feature-based and graph-based datasets. Then we will discuss the performance of multi-view clustering on them briefly.
4.1. Feature-based datasets
Audio genre  consists of 1886 audio tracks classified into 9 music genres, which are Blues, Electronic, Jazz, Pop, Rap/HipHop, Rock, Folk/Country, Alternative, and Funk/Soul. Forty-nine low-level audio features have been extracted and they are grouped into 15 vector spaces.
NUS-WIDE  is a web image dataset composed of 269,648 images, 5018 related tags, and 81 ground-truth concepts. Six types of low-level features have been extracted: 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5 5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions.
UCF101  consists of 101 human action classes. These actions can be divided into five types: human-object interaction, body-motion only, human-human interaction, playing musical instruments, and sports. There are over 13,000 clips and 27 hours of video data in it.
Handwritten numerals  is composed of 2000 handwritten digits which are divided into 10 classes. Four types of feature sets have been extracted: Zernike moments, Karhunen-Loeve features, Fourier descriptors, and image vectors. For Zernike set, it has 47 rotation invariant Zernike moments and 6 morphological features. For Fourier set, it has 76 two-dimensional shape descriptors. Both Zernike and Fourier feature sets are rotation invariant. For Karhunen-Loeve set, it has 64 Karhunen-Loeve transform which corresponds to the projection of images onto the eigenvectors of a covariance matrix.
4.2. Graph-based datasets
DBLP coauthorship  is a coauthorship network composed of 10,305 authors. There are 617 layers in it, each layer representing different publication categories.
Facebook  is a three-layer social network composed of 1640 users with multiple types of ties. The first layer shows whether two users are friends. The second layer shows whether users are in a same group. The third layer shows whether users are in the same photos uploaded by users.
CiteSeer  consists of 3312 scientific publications classified into 6 classes, which are Agents, AI, DB, IR, ML, and HCI. It can be represented as an annotated network, where nodes represent scientific publications and links represent the citation relationships. For each node, there is a 3703-dimensional one-hot encoding vector representing the absence/presence of key words.
Enron e-mail  consists of 184 users and 44 layers. Although it is a temporal network, it can be considered as a multi-layer network. Each layer represents communication in different months.
4.3. Performance on different datasets
For feature-based datasets, when confronted with the situation where we need to reconstruct the views, the performance of classical methods, like deep learning, is not promising. But multi-view clustering can give satisfactory results under this condition. In some cases, classical methods can also give good performance for feature-based datasets where all features are descriptions of the same object from different perspectives. For graph-based datasets, multi-view clustering naturally fits into them since different graphs can be processed by different views.
For both feature-based and graph-based datasets, when the scale of datasets becomes significantly large, most multi-view clustering methods have the potential to outperform other clustering methods on speed. For example, multi-view matrix factorization is quite suitable to parallel process.
5. Open issues
Although multi-view clustering has demonstrated its superiority over single-view clustering in many applications, there are still many open issues deserving much more attention from both academia and industry. Several vital open issues are summarized in this part.
5.1. View construction
Although there are many typical methods to construct views, they all have their own drawbacks. It is well known that if we cannot extract valuable information from the original data and put it into different views appropriately, the performance will be highly limited no matter how delicate the algorithm is. So it is important to find efficient ways of constructing and evaluating multiple views.
5.2. Incomplete view
When constructing different views, we may find that for some views, the information is not complete. In other words, even though we know how to construct views appropriately, we do not have enough information to do it, which is very common in practical problems. In real world, it is very difficult to ensure the completeness of data. This unbalanced relationship between complete views and incomplete views could cause huge problems. Moreover, these incomplete views may influence views with complete information. To solve it, one possible way is to construct these lost information from other views.
5.3. Single-view to multi-view
In multi-view learning, sometimes researchers will convert single-view data into multiple views and apply relevant algorithms on them. In practice, it may give good performance, but there are few theoretical researches on the proof of its reliability. Since the original data is single view, it is important to make it clear: is it necessary to complicate a simple task? We should not only focus on the final performance, the trade-off between cost and benefit is also important.
5.4. Deep leaning in multi-view
Deep learning has shown remarkable performance in many fields. One common way to deal with data composed of different types of sources is to combine them together and then feed them into a deep learning model. It often works well. Although multi-view learning seems to be a more reasonable way to deal with data composed of different types of sources, there is no evidence showing that multi-view learning has an obvious advantage over deep learning. Another issue is that when using deep learning in multi-view learning, we need to train different neural networks for different views separately. This method has two drawbacks. One is that the number of neural networks depends on the number of views. When there are many views, the calculation is huge. The other is that it fails to unify different views during training.
Multi-view clustering has demonstrated variety of real-world applications, such as community detection in social networks, image annotation in computer vision, cross-domain user modeling in recommendation systems, and protein interaction analysis in bioinformatics. This chapter provides a comprehensive review of the typical multi-view clustering methods and their corresponding recent developments by focusing on five most typical and popular clustering methods, which include -means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. The basic forms of these five clustering methods are introduced in detail, followed by a substantial overview of their recent developments. Several open datasets and open issues are discussed in the end, which deserves more attention to facilitate the future research of multi-view clustering.
In the field of multi-view clustering, there are many algorithms whose source codes are exposed by their authors. For example, the co-training1 and co-regularization2 methods of classical multi-view spectral clustering are open in GitHub with MATLAB. The variants MSE3 and AMGL4 are also implemented by MATLAB.
- http://www.escience.cn/people/fpnie/index.html;jsessionid = 253C211B5AEDB8C09865FFEAEAACFB73-n1