0 A Semi-Supervised Clustering Method Based on Graph Contraction and Spectral Graph Theory

Semi-supervised learning is a machine learning framework where learning from data is conducted by utilizing a small amount of labeled data as well as a large amount of unlabeled data (Chapelle et al., 2006). It has been intensively studied in data mining and machine learning communities recently. One of the reasons is that, it can alleviate the time-consuming effort to collect “ground truth” labeled data while sustaining relatively high performance by exploiting a large amount of unlabeled data. (Blum & Mitchell, 1998) showed the PAC learnability of semi-supervised learning, especially in classification problem.


Introduction
Semi-supervised learning is a machine learning framework where learning from data is conducted by utilizing a small amount of labeled data as well as a large amount of unlabeled data (Chapelle et al., 2006).It has been intensively studied in data mining and machine learning communities recently.One of the reasons is that, it can alleviate the time-consuming effort to collect "ground truth" labeled data while sustaining relatively high performance by exploiting a large amount of unlabeled data.(Blum & Mitchell, 1998) showed the PAC learnability of semi-supervised learning, especially in classification problem.
On the other hand, data clustering, also called unsupervised learning, is a method of creating groups of objects, or clusters, in such a way that objects in one cluster are very similar and objects in different clusters are quite distinct.Clustering is one of the most frequently performed analysis (Jain et al., 1999).For example, in web activity logs, clusters can indicate navigation patterns of different user groups.Another direct application could be clustering of gene expression data so that genes within a same group evinces similar behavior.
Although labeled data is not required in clustering, sometimes constraints on data assignment might be available as domain knowledge about the data to be clustered.In such a situation, it is desirable to utilize the available constraints as semi-supervised information and to improve the performance of clustering (Basu et al., 2008).By regarding constraints on data assignment as supervised information, various research efforts have been conducted on semi-supervised clustering (Basu et al., 2004;2008;Li et al., 2008;Tang et al., 2007;Xing et al., 2003).Although various forms of constraints can be considered, based on the previous work (Li et al., 2008;Tang et al., 2007;Wagstaff et al., 2001;Xing et al., 2003), we deal with the following two kinds of pairwise constraints in this paper: must-link constraints and cannot-link constraints.In this chapter, the former is also called as must-links, and the latter as cannot-links.
When similarities among data instances are specified, by connecting each pair of instances with an edge with the corresponding similarity, the entire data instances can be represented as an edge-weighted graph.In this chapter we present our semi-supervised clustering method based on graph contraction in general graph theory and graph Laplacian in spectral graph theory.Graph representation enables to deal with two kinds of pairwise constraints as well as pairwise similarities over a unified representation.Then, the graph is modified by contraction in graph theory (Diestel, 2006) and graph Laplacian in spectral graph theory (Chung, 1997;von Luxburg, 2007) to reflect the pairwise constraints.
Representing the relations (both pairwise constraints and similarities) among instances as an edge-weighted graph and modifying the graph structure based on the specified constraints enable to enhancing semi-supervised clustering.In our approach, the entire data instances are projected onto a subspace which is constructed with respect to the modified graph structure, and clustering is conducted over the projected data representation of instances.Although our approach utilizes graph Laplacian as in (Belkin & Niyogi, 2002), our approach differs from previous ones since pairwise constraints for semi-supervised clustering are also utilized in our approach for constructing the projected data representation (Yoshida, 2010;Yoshida & Okatani, 2010).
We report the performance evaluation of our approach, and compare it with other state-of-the-art semi-supervised clustering methods in terms of accuracy and running time.Extensive experiments are conducted over real-world datasets.The results are encouraging and indicate the effectiveness of our approach.Especially, our approach can leverage small amount of pairwise constraints to increase the performance.We believe that this is a good property in the semi-supervised learning setting.
The rest of this chapter is organized as follows.Section 2 explains the framework of semi-supervised clustering.Section 3 explains the details of our approach for clustering under pairwise constraints.Section 4 reports the performance evaluation over various document datasets.Section 5 discusses the effectiveness of our approach.Section 6 summarizes our contributions and suggests future directions.

Preliminaries
Let X be a set of instances.For a set X, |X| represents its cardinality.AgraphG =(V , E) consists of a finite set of vertices V ,asetofedgesE over V × V .Th esetE can be interpreted as representing a binary relation over V .A pair of vertices (v i , v j ) is in the binary relation defined by a graph G =(V, E) if and only if the pair (v i , v j ) ∈ E.
An edge-weighted graph G =( V , E, W) is defined as a graph G =( V, E) with a weight on each edge in E.W h e n|V | = n, i.e., the number of vertices in a graph is n, the weights in W can be represented as an n by n matrix W1 ,wherew ij in W stands for the weight on the edge for the pair (v i , v j ) ∈ E. W ij also stands for the element w ij in the matrix.We set w ij = 0 for pairs (v i , v j ) ∈ E. In addition, we assume that G =( V, E, W ) is an undirected, simple graph without self-loops.Thus, the weight matrix W is symmetric and its diagonal elements are zeros.

Clustering
In general, clustering methods can be divided into two approaches: hierarchical methods and partitioning methods.(Jain et al., 1999).Hierarchical methods construct a cluster hierarchy, or a tree of clusters (called a dendrogram), whose leaves are the data points and whose internal nodes represent nested clusters of various sizes (Guha et al., 1998).Hierarchical methods can be further subdivided into agglomerative and divisive ones.On the other hand, partitioning methods return a single partition of the entire data under a fixed parameters (number of clusters, thresholds, etc.).Each cluster can be represented by its centroid (k-means algorithms (Hartigan & Wong, 1979)), or by one of its instances located near its center (k-medoid algorithms (Ng & Han, 2002)).For a recent overview of various clustering methods, please refer to (Jain et al., 1999).
When pairwise similarities among instances are specified, the entire data can be represented as an edge-weighted graph.Various graph-theoretic clustering approaches have been proposed to find subsets of vertices in a graph based on the edges among the vertices.Several methods utilizes graph coloring techniques (Guënoche et al., 1991;Yoshida & Ogino, 2011).Other methods are based on the flow or cut in graph, such as spectral clustering (von Luxburg, 2007).Graph-based spectral approach is also utilized in information-theoretic clustering (Yoshida, 2011).

Semi-supervised clustering
When the auxiliary or side information for data assignment in clustering is represented as a set of constraints, the semi-supervised clustering problem is (informally) described as follows.
Problem 1 (Semi-Supervised Clustering).For a given set of data X and specified constraints, find a partition (a set of clusters) T = {t 1 ,...,t k } which satisfies the specified constraints.
There can be various forms of constraints.Based on the previous work (Li et al., 2008;Tang et al., 2007;Wagstaff et al., 2001;Xing et al., 2003), we consider the following two kinds of constraints defined in (Wagstaff et al., 2001): Definition 1 (Pairwise Constraints).For a given data instances X and a partition (a set of clusters) C = {c 1 ,...,c k }, must-link constraints C ML and cannot-link constraints C CL are sets of pairs such that: Intuitively, must-link constraints (also called must-links in this paper) specifies the pairs of instances in the same cluster, and cannot-link constraints (also called cannot-links) specifies the pairs of instances in different clusters.

Graph-based semi-supervised clustering 3.1 A graph-based approach
By assuming that some similarity measure for the pairs of instances is specified, we have proposed a graph-based approach for constrained clustering problem (Yoshida, 2010;Yoshida & Okatani, 2010).Based on the similarities, the entire data instances X can be represented as an edge-weighted graph G =( V, E, W ) where w ij represents the similarity between a pair (x i , x j ).In our approach, each data instance x ∈ X corresponds to a vertex v ∈ V in G. Thus, we abuse the symbol X to denote the set of vertices in G in the rest of the paper.Also, we assume that all w ij is non-negative.
Definition 1 specifies two kinds of constraints.For must-link constraints, our approach utilizes a method based on graph contraction in general graph theory (Diestel, 2006) and treat it as hard constraints (Sections 3.2); for cannot-link constraints, our approach utilizes a method based on graph Laplacian in spectral graph theory (Chung, 1997;von Luxburg, 2007) and treat them as soft constraints under the optimization framework (Section 3.3).The overview of our approach is illustrated in Fig. 1.

Graph contraction for must-link constraints
When must-link constraints are treated as hard constraints, the transitive law holds among the constraints.This means that, for any two pairs (x i , x j ) and (x j , x l ) ∈ C ML , x i and x l should also be in the same cluster (however, the cluster label is not known).In order to enforce the transitive law in must-links, we utilize graph contraction in general graph theory (Diestel, 2006) and modify the graph G for a data set X based on the specified must-links.
Definition 2 (Contraction).Let e=(x i , x j ) be an edge of a graph G = (X, E). define By G/e, we denote the graph (X',E') obtained from G by contracting the edge e into a new vertex x e ,where: G/e stands for the graph obtained from G by contracting an edge e into a new vertex x e .T h e created vertex x e becomes adjacent to all the former neighbors of x i and x j .
By contracting an edge e into a new vertex x e , the newly created vertex x e becomes adjacent to all the former neighbors of x i and x j .Repeated application of contraction for all the edges (pairs of instance) for must-links guarantees that the transitive law in must-links is sustained in the cluster assignment.
As described above, the entire dataset X is represented as an edge-weighted graph G in our approach.Thus, after contracting an edge e=(x i , x j ) ∈ C ML into the newly created vertex x e , it is necessary to define the weights in the contracted graph G/e.The weights in G represent the similarities among vertices.The original similarities should at least be sustained after contracting an edge in C ML , since must-link constraints are for enforcing the similarities, not for reducing.
Based on the above observation, we define the weights in the contracted graph G/e as: where w(•, •) ′ stands for the weight in the contracted graph G/e.I ne q .( 5 ) ,t h ef u n c t i o nmax realizes the above requirement, and guarantees the non-decreasing properties of similarities (weights) after contraction of an edge.On the other hand, the original weight is preserved in eq.( 6).
For each pair of edges in must-links, we apply graph contraction and define weights in the contracted graph based on eq.( 5) and eq.( 6).This results in modifying the original graph G into another graph Note that the originally specified cannot-links also need to be modified during graph contraction with respect to must-links.The updated cannot-links over the created graph G ′ is denoted as C ′ CL .

Spectral clustering
The objective of clustering is to assign similar instances to the same cluster and dissimilar ones to different clusters.To realize this, we utilize spectral clustering, which is based on the minimum cut of a graph.In spectral clustering (Ng et al., 2001;von Luxburg, 2007), data clustering is realized by seeking a function f : X →Rover the dataset X such that the learned function assigns similar values for similar instances and vice versa.The values assigned for the entire dataset can be represented as a vector.By denoting the assigned value for the i-th data instance as f i , data clustering can be formalized as an optimization problem to find the vector f which minimizes the following objective function : where f t is a transpose of vector f , and the matrix L is defined as: where diag() in eq.( 8) represents a diagonal matrix with the specified diagonal elements.The matrix D in eq.( 8) is the degree matrix of a graph, and is calculated based on the weights in the graph.The matrix L in eq.( 9) is called graph Laplacian (Chung, 1997;Ng et al., 2001;von Luxburg, 2007).Some clustering method, such as kmeans (Hartigan & Wong, 1979) or spherical kmeans (skmeans) (Dhillon & Modha, 2001) 2 , is applied to the constructed data representation of instances (Ng et al., 2001;von Luxburg, 2007).

Graph Laplacian for cannot-link constraints
We utilized the framework of spectral clustering in Section 3.3.1.Furthermore, to reflect cannot-link constraints in the clustering process, we formalize the clustering under constraints 2 skmeans is a standard clustering algorithm for high-dimensional sparse data.

107
A Semi-Supervised Clustering Method Based on Graph Contraction and Spectral Graph Theory www.intechopen.comas an optimization problem, and consider the minimization of the following objective function: where i and j sum over the vertices in the contracted graph G ′ ,a n dC ′ CL stands for the cannot-link constraints over G ′ .λ ∈ [0, 1] is a hyper-parameter in our approach.The first term corresponds to the smoothness of the assigned values in spectral graph theory, and the second term represents the influence of cannot-links in optimization.Note that by setting λ ∈ [0, 1], the objective function in ( 10) is guaranteed to be a convex function.
From the above objective function in eq.( 10), we can derive the following unnormalized graph Laplacian L ′′ which incorporates cannot-links as: The matrix L ′′ is defined based on the following matrices: 0o t h e r w i s e ( 12) where ⊙ stands for the Hadamard product (element-wise multiplication) of two matrices.
The above process amounts to modifying the representation of the contracted graph G ′ into another graph G ′′ ,w i t ht h em o d i fi e dw e i g h t sW ′′ in eq.( 13).Thus, as illustrated in Fig. 1, our approach modifies the original graph G into the contracted graph G ′ with must-link constraints, and then into another graph G ′′ with cannot-link constraints and similarities.
It is known that some form "balancing" among clusters needs to be considered for obtaining meaningful results (von Luxburg, 2007).Based on eq.( 14) and eq.( 16), we utilize the following normalized objective function: over the graph G ′′ .Minimizing J sym in eq.( 17) amounts to solving the generalized eigen-problem L ′′ f = αD ′′ f ,w h e r eα corresponds to an eigenvalue and f corresponds to the generalized eigenvector with the eigenvalue.12) ∼ eq.( 15).
, with the smallest non-zero eigenvalues.7: Conduct clustering of data which are represented as F and construct clusters.8: return clusters Furthermore, the number of generalized eigenvectors can be extended to more than one.In that case, the generalized eigenvectors with positive eigenvalues are selected with ascending order of eigenvalues.The generalized eigenvectors with respect to the modified graph corresponds to the embeeded representation of the whole data instances.

Algorithm
The graph-based semi-supervised clustering method (called GBSSC) is summarized in Algorithm 1.The contracted graph G ′ is constructed from lines 1 to 3 based on the specified must-links.Lines 4 to 6 conduct the minimization of J sym in eq.( 17), which is represented as the normalized graph Laplacian L ′′ sym at line 5.
These correspond to the spectral embedding of the entire data instances X onto the subspace spanned by F = { f 1 ,..., f l } (Belkin & Niyogi, 2002).Note that pairwise constraints for semi-supervised clustering are also utilized on the construction of the embedded representation in our approach and thus differs from (Belkin & Niyogi, 2002).Some clustering method is applied to the data at line 7 and the constructed clusters are returned.Currently spherical kmeans (skmeans) (Dhillon & Modha, 2001) is utilized at line 7.

Datasets
Based on the previous work (Dhillon et al., 2003;Tang et al., 2007), we evaluated our approach on 20 Newsgroup dataset (hereafter, called 20NG)3 and TREC datasets4 .Clustering of these datasets corresponds to document clustering, and each document is represented in the standard vector space model based on the occurrences of terms.huge in general, these are high-dimensional sparse datasets.Please note that our approach is generic and not specific to document clustering.
As in (Dhillon et al., 2003;Tang et al., 2007), 50 documents were sampled from each group (cluster) in order to create a sample for one dataset, and 10 samples were created for each dataset.For each sample, we conducted stemming using porter stemmer5 and MontyTagger6 , removed stop words, and selected 2,000 words with descending order of mutual information (Cover & Thomas, 2006).
For TREC datasets, we utilized 9 datasets in Table 2.We followed the same procedure in 20NG and created 10 samples for each dataset7 .Since these datasets are already preprocessed and represented as count data, we did not conduct stemming or tagging.

Evaluation measures
For each dataset, the cluster assignment was evaluated with respect to Normalized Mutual Information (NMI) (Strehl & Ghosh, 2002;Tang et al., 2007).Let C, Ĉ stand for random variables over the true and assigned clusters.NMI is defined as where H(•) is Shannon Entropy, and I(•; •) is Mutual Information among the random variables C and Ĉ. NMI corresponds to the accuracy of assignment.Thus, the larger NMI is, the better the cluster assignment is with respect to the "ground-truth" labels in each dataset.
All the compared methods first construct the representation for clustering and then apply some clustering method (e.g., skmeans).The running time (CPU time in second) for representation construction was measured on a computer with Debian/GNU Linux, Intel Xeon W5590, 36 GB memory.All the methods were implemented with R language and R packages.

Comparison
We compared our approach with SCREEN (Tang et al., 2007) and PCP (Li et al., 2008) (details are described in Section 5.2).Since all the compared methods are partitioning based clustering methods, we assume that the number of clusters k in each dataset is available.
SCREEN (Tang et al., 2007) conducts semi-supervised clustering by projecting the given data instances onto the subspace where the covariance with respect to the given data representation is maximized.To realize this, the covariance matrix with respect to the original data representation is constructed and their eigenvectors are utilized for projection.For high-dimensional data such as documents, this process is rather expensive, since the number of attributes (e.g., terms) gets large.To alleviate this problem, PCA (Principal Component Analysis) was first utilized as pre-processing to reduce the number of dimension in the data representation.We followed the same process in (Tang et al., 2007) and pre-processed data by PCA using 100 eigenvectors, and SCREEN was applied to the pre-processed data as in (Tang et al., 2007).
PCP (Li et al., 2008) first conducts metric learning based on the semi-definite programming, and then kernel k-means clustering is conducted over the learned metric.Some package (e.g.Csdp) is utilized to solve the semi-definite programming based on the specified pairwise constraints and similarities.

Parameters
The parameters under the pairwise constraints in Definition 1 are: 1) the number of constraints 2) the pairs of instances for constraints As for 2), pairs of instances were randomly sampled from each dataset to generate the constraints.Thus, the main parameter is 1), the number of constraints, for must-links and cannot-links.We set the numbers of these two types of constrains to be the same, and varied the number of constraints.
Each data instance x in a dataset was normalized such that x t x = 1, and Euclidian distance was utilized for SCREEN as in (Tang et al., 2007).With this normalization, cosine similarity, which is widely utilized as the standard similarity measure in document processing, was utilized for GBSSC and PCP, and the initial edge-weighted graph for each dataset was constructed with the similarities.The number of generalized eigenvectors l was set to the number of clusters k.In addition, following the procedure in (Li et al., 2008), m-nearest neighbor graph was constructed for PCP (m was set to 10 in the experiment).The hyper-parameter λ in eq.( 10) was set to 0.5, since GBSSC is robust to this value as reported in Section 4.2.

Evaluation procedure
For each number of constraints, the pairwise constraints (must-links and cannot-links) were generated randomly based on the ground-truth labels in the datasets, and clustering was conducted with the generated constraints.Clustering with the same number of constraints was repeated 10 times with different initial configuration in clustering.In addition, the above process was also repeated 10 times for each number of constraints.Thus, for each dataset and the number of constraints, 100 runs were conducted.Furthermore, this process was repeated over 10 samples for each dataset.Thus, the average of 1,000 runs is reported for each dataset.

Evaluation of graph-based approach
Our approach modifies the data representation in a dataset according to the specified constraints.Especially, the similarities among instances (weights in a graph) are modified.The other possible approach would be to set the weights (similarities) as: i) each pair (x i , x j ) ∈ C ML to the maximum similarity ii) each pair (x i , x j ) ∈ C CL to the minimum similarity First, we compared our approach for the handling of must-links in Section 3.2 with the above approaches on Multi10 and Multi15 datasets.The results are summarized in Fig. 3.In Fig. 3, horizontal axis corresponds to the number of constraints; vertical one corresponds to NMI.In the legend, max (black lines with boxes) stands for i), min (blue dotted lines with circles) stands for ii), and max&min (green dashed lines with crosses) stands for when both i) and ii) are employed.GBSSC (red solid lines with circles) stands for our approach.
The results in Fig. 3 show that GBSSC outperformed others and that it is effective in terms of the weight modification in a graph.One of the reasons for the results in Fig. 3 is that, when i) (max) is utilized, only the instances connected with must-links are affected, and thus they tend to be collected into a smaller "isolated" cluster.Creating rather small clusters makes the performance degraded.On the other hand, in our approach, instances adjacent to must-links are also affected via contraction.
As for ii) (min), the instances connected with cannot-links are by definition dissimilar with each other and their weights would be small in the original representation.Thus, setting the weights over must-links to the minimal value in the dataset does not affect the overall performance so much.These are illustrated in Fig. 5 and Fig. 6.Next, we evaluated the handling of cannot-links in Section 3.3.We varied the value of hyper-parameter λ in eq.( 10) and analyzed its influence.The results are summarized in Fig. 4. In Fig. 4, horizontal axis corresponds to the value of λ, and the values in the legend corresponds to the number of pairwise constraints (e.g., 10 corresponds to the situation where the number of pairwise constraints are 10).The performance of GBSSC was not so much affected by the value of λ.Thus, our approach can be said as relatively robust with respect to this parameter.In addition, the accuracy (NMI)i n c r e a s e dmonotonically as the number of constraints increased.Thus, it can be concluded that GBSSC reflects the pairwise constraints and improves the performance based on semi-supervised information.

Evaluation on real world datasets
We report the comparison of our approach with other compared methods.In the reported figures, horizontal axis corresponds to the number of constraints; vertical one corresponds to either NMI or CPU time (in sec.).
In the legend in the figures, red lines correspond to our GBSSC, black dotted lines to SCREEN, green lines to PCP.A lso,+PCA stands for the where the dataset was first pre-processed by PCA (using 100 eigenvectors as in (Tang et al., 2007)) and then the corresponding method was applied.GBSSC+PCP (with purple lines) corresponds to the situation where must-links were handled by contraction in Section 3.2 and cannot-links by PCP.

20 Newsgroup datasets
The results for 20NG dataset are summarized in Figs. 7.These are the average of 10 datasets for each set of groups (i.e., average of 1000 runs).The results indicate that our approach outperformed other methods with respect to NMI (Fig. 7) when l=k8 .For Multi5, although the performance of PCP got close to that of GBSSC as the number of constraints increased, GBSSC was faster more than two orders of magnitude (100 times faster).Likewise, GBSSC+PCP and PCP were almost the same with respect to NMI, but the former was faster with more than one order (10 times faster).Although SCREEN+PCA was two to five times faster than GBSSC, it was inferior with respect to NMI.Utilization of PCA as the pre-processing enables this speed-up for SCREEN, in compensation for the accuracy (NMI).
Dimensionality reduction with PCA was effective for the speed-up of SCREEN,butitwasnot for GBSSC.On the other hand, it deteriorated their performance with respect to NMI.T h u s , it is not necessary to utilize pre-processing such as PCA for GBSSC, and still our approach showed better performance.

TREC datasets
The results for TREC datasets are summarized in Fig. 8 and Fig. 9.As shown in Table 2, the number of dimensions (attributes) are huge in TREC datasets.Since calculating the eigenvalues of the covariance matrix with large number of attributes takes too much time, when SCREEN was applied to non-preprocessed data with PCA, it was too slow.Thus, SCREEN was applied only to the pre-processed data in TREC datasets.(shown as SCREEN+PCA).
On the whole, the results were quite similar to those in 20NG.Our approach outperformed SCREEN (in TREC datasets, SCREEN+PCA)w i t hr e s p e c tt oNMI.I t a l s o o u t p e r f o r m e d PCP in most datasets, however, as the number of constraints increased, the latter showed better performance for review and sports datasets.In addition, PCP seems to improve the performance as the number of constraints increase.When GBSSC is utilized with PCP (denoted as GBSSC+PCP in the figure), it showed almost equivalent performance with respect to NMI, but the former was faster with more than one order.

Effectiveness
The reported results show that our approach is effective in terms of the accuracy of cluster assignment (NMI).GBSSC outperformed SCREEN in all the datasets.Although it did not outperformed PCP in some TREC datasets with respect to NMI, but it was faster more than two orders of magnitude.Utilization of PCA as data pre-processing for dimensionality reduction enables the speed-up of SCREEN, in compensation for the accuracy of cluster assignment.On the other hand, PCP showed better performance in some datasets with respect to accuracy of cluster assignment, in compensation for the running time.Besides, since SCREEN originally conducts linear dimensionality reduction based on constraints, utilization of another linear dimensionality reduction (such as PCA) as pre-processing might obscure its effect.
From these results, our approach can be said as effective in terms of the balance between the accuracy of cluster assignment and running time.Especially, it can leverage small amount of pairwise constraints to increase the performance.We believe that this is a good property in the semi-supervised learning setting.

Related work
Various approaches have been conducted on semi-supervised clustering.Among them are: constraint-based, distance-based, and hybrid approaches (Tang et al., 2007).The constraint-based approach tries to guide the clustering process with the specified pairwise instance constraints (Wagstaff et al., 2001).The distance-based approach utilizes metric learning techniques to acquire the distance measure during the clustering process based on the 116 New Frontiers in Graph Theory www.intechopen.comspecified pairwise instance constraints (Li et al., 2008;Xing et al., 2003).The hybrid approach combines these two approaches under a probabilistic framework (Basu et al., 2004).
As for the semi-supervised clustering problem, (Wagstaff et al., 2001) proposed a clustering algorithm called COP-kmeans based on the famous kmeans algorithm.When assigning each data item to the cluster with minimum distance as in kmeans, COP-kmeans checks the constraint satisfaction and assigns each data item only to the admissible cluster (which does not violate the constraints).
SCREEN (Tang et al., 2007) first converts the data representation based on must-link constraints and removes the constraints.This process corresponds to contraction in our approach, but the weight definition is different.After that, based on cannot-link constraints, it finds out the linear mapping (linear projection) to a subspace where the variance among the data is maximized.Finally, clustering of the mapped data is conducted on the subspace.
PCP (Li et al., 2008) deals with the semi-supervised clustering problem by finding a mapping onto a space where the specified constraints are reflected.Using the specified constraints, it conducts metric learning based on the semi-definite programming and learn the kernel matrix on the mapped space.Although the explicit representation of the mapping or the data representation on the mapped space is not learned, kernel k-means clustering (Girolami, 2002) is conducted over the learned metric.

Conclusion
In this chapter we presented our semi-supervised clustering method based on graph contraction in general graph theory and graph Laplacian in spectral graph theory.Our approach can exploit a small amount of pairwise constraints as well as pairwise relations (similarities) among the data instances.Utilization of graph representation of instances enables to deal with the pairwise constraints as well as pairwise similarities over a unified representation.In order to reflect the pairwise constraints on the clustering process, the graph structure for the entire data instances is modified by graph contraction in general graph theory (Diestel, 2006) and graph Laplacian in spectral graph theory (Chung, 1997;von Luxburg, 2007).
We reported the performance of our approach over two real-world datasets with respect to the type of constraints as well as the number of constraints.We also compared with other state-of-the-art semi-supervised clustering methods in terms of accuracy of cluster assignment and running time.The experimental results indicate that our approach is effective in terms of the balance between the accuracy of cluster assignment and running time.Especially, it could leverage a small amount of pairwise constraints to improve the clustering performance.We plan to continue this line of research and to improve the presented approach in future.