Spectral Clustering and Its Application in Machine Failure Prognosis

Machine fault prognosis and health management has received intensive studies for several decades, and various approaches have been taken, such as statistical signal processing, timefrequency analysis, wavelet, and neural networks. Among of them, pattern recognition method provides a systematic approach to acquiring knowledge from fault samples. In fact, mechanical fault diagnosis is essentially a problem of pattern classification.


Introduction
Machine fault prognosis and health management has received intensive studies for several decades, and various approaches have been taken, such as statistical signal processing, timefrequency analysis, wavelet, and neural networks. Among of them, pattern recognition method provides a systematic approach to acquiring knowledge from fault samples. In fact, mechanical fault diagnosis is essentially a problem of pattern classification.
Many pattern recognition methods have been studied and applied in machine condition monitoring and fault prognosis. Campbell proposed a linear programming approach to engine failure detection (Campbell&Bennett, 2001). In Ypma's study, different learning methods, such as Independent Component Analysis, Self Organising Map, and Hidden Markov Models, were applied in fault feature extraction, novelty detection and dynamic fault recognition (Ypma, 2001). Ge et.al (2004)proposed a support vector machine based method for sheet metal stamping monitoring. Harkat et.al(2007) applied non-linear principal component analysis in sensor fault detection and isolation. Lei and Zuo (2009) implemented the Weighted k Nearest Neighbour algorithm to identify the gear crack level.
However, the information of machine incipient fault is always weak and contaminated by strong noises, and there is always lack of fault samples to train the learning machine. Therefore, the key issue is how to select sensitive features from the dataset for machine incipient faults prognosis, which is related to feature selection and dimension reduction, and is very useful for fault classification.
In most of medical and clinic applications, when the dimensionality of the data is high, for reducing computation complexity, some techniques might be used to project or embed the data into a lower dimensional space while retaining as much information as possible.
Classical linear examples are Principal Component Analysis (PCA) (Jolliffe.2002) and Multi-Dimensional Scaling (MDS) (T. F. Cox & M. A. Cox, 2001). The coordinates of the data points in the lower dimension space might be used as features or simply a mean to visualize the data.
However, for common PHM(Prognostic and Health Management) applications, the dimensionality of the data is not as high as those in medical research, and the mapping techniques are mainly applied to reveal the correlation of features as to increase the accuracy of fault detection and identification. The selection of features also can avoid unnecessary sensors used in machine monitoring, considering the high cost maintaining. Nomikos and MacGregor(1994) firstly presented a PCA approach for monitoring batch process, the history information was linear projected onto a low-dimensional space that summarized the key characteristics of normal behaviour by both variable and their time histories. Considering that minor component discarded in PCA might contain important information on nonlinearities, a large amount of nonlinear methods were presented for the process monitoring and chemical process modelling (Dong & McAvoy,1996;Kaspar & Ray,1992;Sang et. al,2005), such as Kernal PCA (Schölkopf,1998).
Non-linear dimensionality mapping methods are more frequently recognized as non-linear manifold learning methods. The manifold learning is the process of estimating a lowdimensional underlying structure embedded in a collection of high-dimensional data ( Tenenbaum et. al, 2000;Roweis & Saul, 2000). Instead of using Euclidian distance to measure samples' similarity in input space, samples' similarity in latent space is measured by their geodesic or short path distance. The deceptive close distance in the highdimensional input space can be corrected.
Spectral clustering is a graph-theory-based manifold learning method, which can be used to dissect the graph and get the clusters for exploratory data analysis. Compared with the traditional algorithms such as k-means, spectral clustering has many fundamental advantages. It is more flexible, capturing a wider range of geometries, and it is very simple to implement and can be solved efficiently by standard linear algebra methods. It has been successfully deployed in numerous applications in areas such as computer vision, speech recognition, and robotics. Moreover, there is a substantial theoretical literature supporting spectral clustering (Kannan et.al,2004;Luxburg,2007Luxburg, ,2008.
In most PHM applications, multi-groups of data sets from different failure modes are frequently nonlinearly distributed and mixed in a high dimensional feature space. However, an "unfolded" feature space is expected as to differentiate these degradation patterns by a designed classifier.
In this part, we first propose a spectral clustering based feature selection method used for machine fault feature extraction and evaluation, and then the samples with selected features are input into a density-adjustable spectral kernel based transductive support vector machine to train and to get the prognosis results.

Basics of graph theory
Given a d-dimentsional data points {x 1 , . . ., x n }, and the similarity between all pairs of data points x i and x j is noted as w ij . According to graph theory, the data points can be represented www.intechopen.com Spectral Clustering and Its Application in Machine Failure Prognosis 375 by an undirected data graph G=(V,E). Each node in this graph represents a data point x i . Two nodes are connected if the similarity w ij between the corresponding data x i and x j is positive or larger than a certain threshold, and the edge is weighted by w ij . These data points can be divided into several groups such that points in the same group are similar and points in different groups are dissimilar to each other.

Laplacian embedding
BelKin (2003) indicated that Laplacian Eigenmaps used spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low dimensional manifold in a high dimensional space. The Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The embedding maps for the data come from approximations to a natural map that is defined on the entire manifold.
The popular Laplacian Embedding algorithm includes the following steps, as shown in Fig.1.  Step 1: The d-dimensional dataset is viewed as an undirected data graph [10] , G = (V, E) with node set V={x 1 ,...,x n }. Every node in the graph is one point in  d . An edge is used to link node i and node j, if they are close as ε-neighborhoods which means the distance between nodes X i and X j satisfying , or if node X i is among n nearest neighbors of X j or X j is among n nearest neighbors of X i .
Step 2: Each edge between two nodes X i and X j carries a non-negative weight w ij ≥0. The weighted adjacency matrix of the graph is the matrix The degree of a node X i  V is defined as 1 n ii j dw   . The degree matrix D is defined as the diagonal matrix with {d 1 , d 2 ,…,d n } on its diagonal. The un-normalized graph Laplacian matrix is defined by Luxburg(2007) as: L=D-W .
Step 3: The Laplacian Eigenmap (on normalized Laplacian matrix) is computed by spectral decomposition for eigenvectors problem of Ly = Dy. The image of X i under the embedding is converted into the lower dimensional space m, given by ordered eigenvalues: {y 1 (i), y 2 (i),..., y m (i)}. This decomposition provides significant information about the graph and distribution of all points. It has been proven experimentally that the inner natural groups of dataset are recovered by mapping the original dataset into the space spanned by eigenvectors of the Laplacian matrix (Belkin & Niyogi,2003).

Supervised feature selection criterion by Laplacian scores
Given a graph G, the Laplacian matrix L of G is a linear operator on any feature vector from The equation quantifies how much the feature vector is consistent with the structure of the G locally. For the instances closer to each other, the features that have similar value for them are contributes more on the dissimilarity matrix that is consistent with data structure. The flatter the feature value is over all instances, the smaller the value of the equation. However, instead of the feature consistency only considering instances with small distance, a complete definition of feature consistency with the data structure is clarified as the following: Definition 1: (feature local consistency with data graph) Given data graph G= (V, E) (V={X 1 ,...,X n },E={W ij }), the feature f is a locally consistent variant of G at level h (0<h<1) for a clustering C over G. If for every cluster C k of C, there is 2 , The definition is a ratio between inner and intra cluster variation caused by the individual feature. Perfect clustering expects less variance inter-cluster and the inverse for intraclusters. If the feature f contributes to better clustering, the nominator tends to be smaller and denominator is larger. Therefore k h is expected to be smaller. The feature consistency index h indicates the features's weakest separablility for clustering C In terms of graph theory, similar criterion can be formulated based on Eq.(4), and configure data graph G with following similarity measurement, 1, is the similarity measurement of samples within-class, and (2) ij w that of samples between-class. Then the sequence of instances can be reordered to make the adjacency matrix carry closer instances along its diagonal.
(2) 11 1 (2) (2) As proved by He et.al (2006), Laplacian score of r-th feature is as the follows: Because of TT rrrr  fL f fL f  (He et.al, 2006), and with the weight matrix as ( Therefore, from Eq. (14), instead of the feature consistency index in Definition 1, the ratio of two Laplacian scores can also be considered as equivalent estimation of feature consistency.
They are over the data graph with the configuration of (2) W and (1) W . If the feature is consistent with these data graphs, term of (1) r L should be smaller and (2) r L be larger. Therefore, from graph theory perspective, the supervised feature selection criterion by Laplacian score can be defined as follows (1) (2) Based on the criterion, the feature can be ranked, and a simple searching engine can be defined to select appropriate number of features from the list.

Density-adjustable spectral clustering
Commonly, the weight of the edge in a Graph is defined by the Euclid distance between the two nodes, and it works very well with the linear data.
But for nonlinear data, such as two clusters shown in Fig.2, data points a and c belong to the same cluster, and the Euclid distance between points a and b is less than that between points a and c. Therefore, it is necessary to measure the similarity of data points in a different way, which can zoom out the path length of those passing through low density area, and zoom in those not. Then the minimum path can be obtained to replace the Euclid distance. It is very useful for machine failure prognosis, because there always exists nonlinear when machine anomaly occurring. Chapelle et.al (2005) proposed a density-sensitive distance based on a density-adjustable path length definition as follows, www.intechopen.com i j dist x x is the Euclid distance between data x i and data x j , and  is the density adjustble factor( 1   ). This definition is satisfied with the cluster assumption, and can be used to describe the consistency of data structure by adjusting the factor  to zoom out or in the length between the two data points. Therefore, the similarity of the data point x i and x j can be expressed as following, Where ( ( , )) i j dsp lx x is denoted as the minimum distance between data x i and x j , which is the shortest path based on density adjustment.

Transductive support vector machine
Support vector machine is one of supervised learning methods based on statistical learning theory (Vapnik, 1998). Instead of Empirical Risk Minimization (ERM), Structural Risk Minimization (SRM) is an inductive principle for model selection used for learning from finite training data sets, which enhances the generalization ability of the SVM. The key to SVM is the "kernel tricks", by which the nonlinear map can be realized from low dimensional space to high dimensional space. Therefore, the nonlinear classification task in low dimensional space can be converted to a linear classification, which can be solved by finding a best hyperplane in the high dimensional space.
Considering of 2-class data points, there are many hyperplanes that might classify the data. The best hyperplane is the one that represents the largest margin between the two classes, and the distance from this hyperplane to the nearest data point on each side is maximized.  here '· ' denotes the dot product and w is normal vector to the hyperplane, and b is offset from the hyperplane to the margin. As for Transductive Support Vector Machine (TSVM), it is one of semi-supervised learning methods, which can combine the labelled data with amounts of unlabelled data co-training. TSVM uses an idea of maximizing separation between labelled and unlabelled data (Vapnik, 1998


Where C and * C are the penalty factors corresponding to labeled and unlabeled data, i  and * j  are the slack factors respectively, l is the number of labeled data and k that of unlabeled. These parameters are set by user, and they allow trading off margin size against misclassifying training samples or excluding test samples.

Density-adjustable spectral kernel based TSVM
Combine the ideas of density-adjustable spectral clustering (Chapelle & Zien,2005) and TSVM, we can get the density-adjustable spectral kernel based TSVM algorithm, called DSTSVM. The data is pre-processed by density-adjustable spectral decomposition, and the processed data is input into the TSVM which is trained by gradient descent on a Gaussian kernel, then the data is classified. The implementation of the DSTSVM algorithm is as following, Input: n-dimension data X{X 1 ,…,X m } (some labelled and others unlabelled) Parameter: density-adjustable factor  , penalty factor C and kernel width  of the Gaussian kernel. (Set by user) Output: The label of unlabelled data and the correctness of classification Step.1 Calculate the Euclid distance matrix S of data X Step.2 Calculate the shortest path matrix S 0 according to the Eq.16 Step.3 Construct the Graph G based on data matrix S 0 . Define the similarity of between nodes as 2 0 (,) / 2 ij sxx ij we    , and then the degree diagonal matrix can be denoted as (,) ij Dii w   .
Step.4 Calculate the Laplacian matrix Step.5 Select the first r nonnegative eigenvectors according to   11 85% rn ij ij     .
Step.6 Get the new data set as Step.7 Train the TSVM by gradient descent using the newdata and then get the classification result.

Case study
To demonstrate that the proposed feature selection method and DSTSVM classifier are effective in machine failure prognosis, we applied the methods in feed axis faults feature selection and classification.

Experiments
Feed axis is one of critical components in a high-precision numerical control machine tool, which always working in conditions such as high speed, heavy duty and large travel distance. This would augment the degradation of mechanical parts such as bearings, ball nuts and so on. From a preventive maintenance perspective, autonomous fault detection and feed axis health assessment could reduce the possibility of causing more severe damage and downtime to machine tool.
TechSolve Inc. collaborated with the NSF Intelligent Maintenance System Center (IMS) to investigate intelligent maintenance techniques for autonomous feed axis failure diagnosis and health assessment. For the investigation, designed experiments were conducted on a feed axis test-bed built by TechSolve. Multiple seeded failures were tested on the system such as axis front and back ball nut misalignment, bearing misalignment and so on (Siegal et.al, 2011). 13 channels (bearing and ball nut accelerometers, temperature and speed; motor power; encode position and so on) data were collected from the test-bed over a period of approximate 6 months. Since all the tests were designed to carry certain failures under different working conditions, the collected information was labeled in terms of the four condition indices including the test index, the load, ball nuts condition, and bearing condition.

Mode
Test index Table  Load BallNut Misalignment

Feature selection and fault classification
The samples were collected under 4 modes, which were Health, Failure 1(Bearing misalignment 0.007μm), Failure 2 (Ballnut misalignment 0.007μm), and Mode 4 (Failure 1 accompanied with Failure 2). Every mode had two working conditions with load at 0 and 300Kw, and 25 samples at every condition. As for each sample, there were 154 features which contain 117 vibration features (RMS, kurtosis, crest factor at different time periods, and average energy of selected frequency bands) and 37 other features (torque, temperature, position error, and power at different time periods). Therefore, there were totally 200 154-D samples used for investigation.
All the features were evaluated and ranked by Laplacian score using the proposed feature selection criterion. Among 154 features, there were 22 features selected which can reflect the data structure well with the best classification performance, which was shown in Fig.4. Therefore, the input data dimension can be reduced from 154-D to 22-D. Selecting 25 labelled samples randomly from those 50 22-D samples within every class (totally 100 samples), and the remained 100 samples were regarded as unlabelled ones. Then all these labelled and unlabelled samples were input into the DSTSVM classifier for co-training. This process was repeated for 10 times, and then through 5-fold cross validation, we predicted that which class should the unlabelled samples belong to.
For testing the performance of designed DSTSVM classifier, we reduced the labelled samples to 20 and 10 respectively, and then repeated the procedure above. To verify the effectiveness and correctness, the result was compared with those using SVM (supervised) and TSVM (semi-supervised).
The 10 th classification results using the data (10 labelled samples VS 40 unlabelled each class) were shown in Fig.5, Fig. 6, and Fig. 7 respectively. www.intechopen.com All of the classifiers were trained on Gaussian kernel, and the kernel width  was set as the optimal value corresponding to the different classifiers. There are two parameters  and  influencing the DSTSVM classification, and the density adjustable factor  reflects the data similarity measure, which also affects the Kernel function. In terms of the classification correctness, we can choose the optimal group of these two parameters (  ,  ). The comparison results under different labelled samples were listed in Table. Table.2, the average correctness means the average of 10 testing process by 5-fold CV. It can be observed that the proposed method outperforms the TSVM and equals to the supervised SVM under different labelled samples. Moreover, when the labelled data was reduced to 10 samples, it performed better than SVM, which was very meaningful to practical machine failure prognosis applications.

Conclusion
The proposed feature selection method can capture the structures of the input data, reduce the dimension of the data and expedite the computation process. More importantly, the classification result is also improved by this feature selection method. Compared with traditional supervised SVM learning and the TSVM semi-supervised learning method, the proposed DSTSVM performed better. Experiment results demonstrate that the proposed DSTSVM method is effective and capable of classifying incipient failures. It has great potential for machine fault prognosis in practice. Based on the current work, the proposed approach can be used to quantify and assure the sufficiency of the data for prognostics applications.
In total, the spectral clustering based method was proposed to evaluate data and to select sensitive features for prognostics, furthermore the spectral kernel based TSVM classifier was also proved to be effective in PHM applications.