Condition Monitoring and Fault Diagnosis of Roller Element Bearing

Rolling element bearings play a crucial role in determining the overall health condition of a rotating machine. An effective condition-monitoring program on bearing operation can improve a machine’s operation efficiency, reduce the maintenance/replacement cost, and prolong the useful lifespan of a machine. This chapter presents a general overview of various condition-monitoring and fault diagnosis techniques for rolling element bearings in the current practice and discusses the pros and cons of each technique. The techniques introduced in the chapter include data acquisition techniques, major parameters used for bearing condition monitoring, signal analysis techniques, and bearing fault diagnosis techniques using either statistical features or artificial intelligent tools. Several case studies are also presented in the chapter to exemplify the application of these techniques in the data analysis as well as bearing fault diagnosis and pattern recognition.


Introduction
Rolling element bearings are the most critical but vulnerable mechanical components in a rotating machine. A bearing failure can lead to a complete machine breakdown causing unintended interruption to a production process and financial losses. It is important to have an effective bearing condition monitoring (CM) and fault diagnosis system in place so that incipient bearing faults can be detected and correctly diagnosed on time to prevent them from deteriorating further to cause damage to a machine. For instance, an early detection of incipient defect of a rolling element bearing in a high speed train or a wind turbine can lead to a timely maintenance/replacement to prevent potential disastrous consequence and human loss caused by unexpected failure of critical mechanical components.
Many condition-monitoring and fault diagnosis techniques have been developed in the last few decades to improve the reliability of rolling element bearings. This chapter provides an overview on the most commonly employed condition-monitoring, signal analysis, and fault diagnosis techniques for rolling element bearings and discusses some of the pros and cons of these techniques.
A starting but most fundamental information in bearing condition monitoring is the characteristic bearing defect frequencies. The characteristic defect frequency components in a signal are generated by flaws or faults presented in a bearing when the bearing is operated at a specific machine rotating speed under certain loading conditions. Alternatively, defect signals can also be produced accompanying the normal wear process during a bearing's operational life. Figure 1 shows the graphical and the cross-sectional representations of a rolling element bearing. The bearing comprises four mechanical components: an outer race, an inner race, rollers (balls), and a cage that hold the rollers (balls) in place. Correspondingly, there are four possible characteristic defect frequencies for a rolling element bearing: ball (roller) pass frequency at the outer race (BPFO), ball (roller) pass frequency at the inner race (BPFI), ball (roller) spin frequency (BSF), and fundamental train frequency (cage frequency) (FTF). The formulae for these four characteristic bearing defect frequencies are listed in Table 1. Note: N is the shaft speed in revolutions per minute (RPM), n is the number of roller elements in a bearing, α is the contact angle of the bearing due to the load from the radial plane, d is the diameter of the roller, and D is the mean diameter of the bearing as shown in Figure 1. Table 1. Formulae of the bearing defect frequencies.
A bearing defect signal can be simulated using the following equations [1]: OðtÞ r sn , t < n þ 1 BDF , n ¼ 0, 1, 2, …: (1a) and sðt, nÞ ¼ Qe −γðt− nþ1 BDF Þ sin 2πf r t− n þ 1 BDF ! þ OðtÞ r sn , t ≥ n þ 1 BDF , n ¼ 0, 1, 2, …: where Q is the assumed maximum loading intensity for a bearing defect and t is the time variable, BDF represents a bearing defect frequency, f r is the assumed bearing resonance frequency and α is the energy decay constant of the bearing race. The first part in Eqs. (1a) and (1b) is the signal produced by a bearing defect, and the second part of the equations is the superimposed white Gaussian noise representing the machine background noise. n is the pulse index of the bearing defect frequency, O(t) is the white Gaussian noise and r sn is the assumed signal-to-noise ratio (SNR). A typical bearing defect signal is shown in Figure 2. The parameters used in the simulation of the signal are listed in Table 2.

Condition-monitoring techniques
Condition-monitoring (CM) techniques are required to acquire the operation condition data of a rotating machine. The CM data can then be processed and analyzed using appropriate signal analysis techniques to obtain the most relevant characteristic parameters before being used in a diagnosis or prognosis algorithm to evaluate/predict the health state of the machine. The data employed in a condition-monitoring program can be vibration, noise, electric current, oil and grease, temperature or a combination of these data. The CM data can be analyzed using either time-domain techniques, frequency-domain techniques, or time-frequency techniques according to the data properties such as linearity or nonlinearity, stationary or nonstationary, to extract the most useful and effective features for bearing fault diagnosis. This section summarizes some of the most commonly employed techniques in condition-monitoring applications of rolling element bearings.

Vibration technique
Vibration technique is the most frequently employed technique for machine condition monitoring. A change of the vibration signal in a machine without changing the operation condition can imply a change of health state of the machine. Vibration signals are typically generated by defects in the moving components of the machine such as defects in a bearing, gearboxes, reciprocating components, and so on. For example, when a rolling element bearing operates under a faulty condition, an impulse signal will be generated as other bearing components passing through the faulty position. This will lead to an increase in overall vibration amplitude of the machine. The defect component of the bearing and its severity can be determined by the characteristic defect frequency component contained in the condition monitoring signal (see Figure 2). The frequency range of vibration measurement can be as low as in the infrasonic region (below 20 Hz) and span across to the upper limit of the audible frequency (20 kHz). Though the vibration technique can be problematic in acquiring useful signals when it is deployed for condition monitoring of large low-speed rotating machine where the energy of an incipient defective signal is usually weak and often submerges under the background noise. High-frequency techniques such as acoustic emission (AE) technique can be employed to overcome this limitation.

Acoustic emission technique
Acoustic emission is a transient elastic wave generated by the sudden release or redistribution of stress in a material. For example, in bearing condition-monitoring applications, AE is generated by the sudden release of energy caused by the material deformation as other bearing components passing through a defect part. The signal then propagates to the bearing house to be detected by a monitoring AE sensor. Compared to vibration signals, the AE signal is less likely to be affected by the dominating noise and vibration generated by the moving mechanical components of the monitored system due to the high-frequency nature of the signal (typically above 100 kHz). Care should be taken in choosing the mounting locations of AE sensors to minimize the energy loss along the AE propagation path for better signal clarity. Furthermore, AE also comes with inherited problems such as calibration, nonlinearity, data storage and transfer, data processing and interpretation. Recently, Lin et al. [1][2][3] developed a number of signal-processing algorithms to overcome the problems of nonlinearity and large AE data. Nowadays, a large industry deployment of AE technique in bearing condition monitoring is still restricted by the expensive highly specialized AE data acquisition devices.

Current signal technique
Current signal technique is mainly used to monitor the bearing condition of electric motors. The technique is based on the principle that the vibration signal generated in a motor is closely related to the change of magnetic flux density in the motor. Stator current technique is the frequently employed current signal technique. When a motor bearing operates on a faulty condition, the radial motion of the motor axis would lead to a small shift of the rotor causing a change in the magnetic flux density between the stator and the rotor. The induced voltage will then cause the variation of the stator current. The stator current technique employs noninvasive sensors to monitor the variation of the stator current; therefore, it is easy to implement and simple to operate.

Oil and debris-monitoring technique
The application of oil and debris-monitoring technique is a frequently employed tribology approach for machine condition monitoring. The health condition of a roller element bearing can be monitored by analysis of the properties and particle contents of the lubrication oil. The application of oil and debris-monitoring technique is rather straightforward. Though such tribology analysis is normally undertaken at laboratories using spectrometers and scanning electron microscopes rather than in situ. The technique is also limited to condition-monitoring applications of lubrication-related or wear-related problems.

Thermography technique
Thermography technique detects faults in a bearing by measuring the emission of infrared energy using thermo-infrared devices during bearing operation process. Laser thermography devices or thermo-infrared cameras are the most common instruments used for such measurement. This technique can be applied to monitor the heat variation caused by the change of bearing lubrication, load, and operation speed. It is particularly sensitive in monitoring the overheating phenomenon caused by improper lubrication but less sensitive in detecting incipient fault developed in a bearing such as early pitting, slight wear, or peeling.

Data analysis and fault diagnosis techniques
Many signal analysis and machine fault diagnosis techniques have been developed in the last few decades in order to improve the reliability, efficiency, and lifespan of machines, as well as to reduce the maintenance and operation cost. In this section, the most frequently employed signal-processing and fault diagnosis techniques for rolling element bearings are briefly discussed and summarized.

Time series analysis
Time series analysis is a mathematical method, which handles an observed data series in a statistic manner. Time series analysis is based on the assumption that the variation trend of a historical observed data can be extended to provide an indication/prediction of future data variation of a same monitored system. A typical time series analysis approach is to establish a predictive model based on the observed data series. The three widely adopted univariate time series models in machine fault diagnosis are autoregressive (AR) model, moving average (MA) model, and autoregressive moving average (ARMA) model [4].
A general linear ARMA (p, q) model can be expressed as where y t is the time series needed to be modeled, c is a constant, p is the number of autoregressive orders, q is the number of moving average orders, φ i is the autogressive coefficients, θ j is the moving average coefficients, and ε t is the independent and identically distributed terms, which are commonly assumed to have zero mean and a constant variance, or (i) Eðε t Þ ¼ 0; (ii) Eðε t ε T Þ ¼ 0 for t ≠ T ; and (iii) Eðε 2 t Þ ¼ σ 2 . An AR model and an MA model can be viewed as the special case of an ARMA model. For instance, when all autoregressive coefficients φ i equal to 0 ðφ i ¼ 0, 1 ≤ i ≤ pÞ, an ARMA model degrades to an MA model. On the other hand, when all moving average coefficients θ j equal to 0 ðθ j ¼ 0, 1 ≤ j ≤ qÞ, an ARMA model degrades to an AR model.
After selecting the most suitable time series model for the data analysis, the next step is to determine the orders of AR or MA models. The commonly adopted order parameter determination criteria for these time series models are the final prediction error (FPE) criterion, Akaike information criterion (AIC), and Bayesian information criterion (BIC).
On the other hand, the common methods used to determine the autoregressive coefficients and moving average coefficients are least squares estimation, maximum likelihood estimation, Yuler-Walker estimation, and so on.
It is worth mentioning that ARMA models are developed based on the assumption that the signal is stationary. If the signal is not stationary, some data preprocessing steps need to be adopted. For example, (1) performing a differential operation of the data term by term until the signal satisfies the stationary criterion and (2) decomposing a nonstationary signal by empirical mode decomposition (EMD) to obtain the stationary intrinsic mode functions (IMFs).

Minimum entropy deconvolution
Minimum entropy deconvolution (MED) technique is proposed originally by Wiggins, which has been successfully employed in dealing with the seismic response signal [5]. The basic idea of MED is to find an inverse filter that counteracts the effect of the transmission path [6]. It is designed to reduce the spread of impulse response functions (IRFs) and then obtain signals, which are close to the original impulses giving rise to them. The MED technique has been successfully employed in bearing fault diagnosis such as in [6]. Figure 3 illustrates the basic idea of the MED technique. In this process, an unknown impact signal x(n), which can be as highly impulsive as possible, passes through a structural filter h and then mixes with a noise e(n) to produce an intermediate output o(n). The signal o(n) then passes through an inverse (MED) filter f to produce a final output y(n). The process eliminates the structure resonance and the final output y(n) after the inverse filter needs to be as close as possible to the original input x(n).
The inverse filter f(l) can be modeled as a finite impulse response (FIR) filter with L coefficients: and f ðlÞ Ã hðlÞ ¼ δðn − l m Þ: where δ is a Dirac delta function and l m is the time delay after the MED process, which displaces the entire signal by l m but keeps the impulse spacing of the signal.
The inverse filter f(l) is implemented for the MED technique by the objective function method (OFM). The OFM is an optimization process designed to maximize the kurtosis of the output signal, y(t). OFM achieves this by changing the coefficients of the filter f(l). The kurtosis is taken as the normalized fourth-order moment given by and the maximum kurtosis of y(t) can be obtained according to f(l) for which the derivative of the objective function is zero: where the filter coefficients of f(l) can converge to a given tolerance by the iterative process [7].

Spectral kurtosis
Spectral kurtosis (SK) was first proposed in the 1980s for detecting impulsive events in sonar signals. SK was first applied in bearing fault diagnosis by Antoni [8]. The method basically computes a kurtosis at "each frequency line" in order to discover the presence of hidden nonstationarities and to indicate in which frequency band it takes place. The method has been proved to be relatively robust against strong additive noise. An SK of nonstationary signals is defined based on the Wold-Cramer decomposition, which describes any stochastic random process x(t) as the output of a causal, linear, and time-varying system [9]: where dZ x ð f Þ is an orthonormal spectral process of unit variance and H(t, f) is the time-varying transfer function interpreted as the complex envelope of x(t) at frequency f. The SK of a signal x (t) is defined as a normalized fourth-order spectral accumulation given by [8,9]: in which K x ( f) is the spectral kurtosis of signal x(t) around frequency f and 〈•〉 denotes the averaging over time.
Antoni and Randall [9] proposed two techniques to calculate the spectral kurtosis, one is based on short-time Fourier transform (STFT) (the so-called kurtogram for finding the optimal filter) and the other is based on one-third binary filter banks (fast kurtogram for on-line condition monitoring and fault diagnosis). Kurtogram is a powerful tool for the analysis of nonstationary signals in bearing fault diagnosis, though it has been reported that the technique fails to detect a bearing fault when the defect signal has low signal-to-noise ratio and contains non-Gaussian noise with high peaks [10,11].

Singular-value decomposition
Singular-value decomposition (SVD) is a numerical method which states that a matrix A of rank L can be decomposed into the product of three matrices: U (an orthogonal matrix), Λ (a diagonal matrix) and V T (the transpose of an orthogonal matrix V) [12]. This method is usually presented as where U T U ¼ 1 and V T V ¼ 1; Λ m · n is a diagonal matrix containing the square roots of eigenvalues of A T A which can be expressed as The non-zero diagonal terms σ i ði ¼ 1, 2, ⋯, lÞ together with the zero terms σ lþ1 ¼ … ¼ σ L ¼ 0 form the singular values of the matrix A.
SVD can be an effective tool for data reduction, for example, a reduction of the fault feature dimensions. It is also a useful tool for signal de-noising. The application of SVD in signal de-noising is mainly operated as follows: Suppose a bearing CM signal x(k) can be modeled as where y(k) and n(k) are respectively, the uncontaminated signal and noise. A conversion method such as phase space reconstruction can be used to transform the signal from a onedimensional vector into a two-dimensional matrix as follows: where the matrix A represents the uncontaminated data y(k), which contains characteristic fault information and the matrix N signifies the unwanted noise part n(k). From Eqs. (9) and (11), we have Λ 1 in Eq. (12) contains the significant singular values σ i ði ¼ 1, 2, ⋯, lÞ which are used to construct the uncontaminated data and Λ 0 contains small singular values σ i ði ¼ l þ 1, …, LÞ, which can be viewed as noise.
From Eq. (12), the de-noising signal matrix can be written as A is a reconstructed matrix using only the largest l number of singular values whose values are greater than a pre-set threshold value ε. The rest of the singular values are replaced by zero such that

Power spectrum
Spectrum analysis is the most popular frequency-domain analysis techniques, which transforms a time-domain data into discrete frequency components by Fourier transform. The power spectrum of a time-domain signal is defined as the square of the magnitude of the Fourier transform of a signal. It can be written as where XðωÞ ¼ ∫ þ∞ −∞ xðtÞe −jωt dt is the Fourier transform of a signal, X Ã ðωÞ is its complex conjugate and ω is the angular frequency in radian/s. The power spectrum is frequently employed to extract useful characteristic defect frequency components of a stationary CM signal.

Cepstrum
A cepstrum is defined as the power spectrum of the logarithm of the power spectrum of a signal [13]. The name of cepstrum was derived by reversing the first four letters of spectrum. There are four types of cepstra: a real cepstrum, a complex cepstrum, a power cepstrum and a phase cepstrum.
The real cepstrum of a signal x(t) is given by The complex cepstrum of a signal x(t) is given by The power cepstrum of a signal x(t) is given by The most commonly used cepstrum in machine condition monitoring is power cepstrum. The following steps can be taken to calculate the power cepstrum of a signal: A signal ! power spectrum of the signal ! logarithm of the power spectrum ! power spectrum of the data from the previous step ! inverse Fourier transform of the log power spectrum from the previous step (power cepstrum). The main application of cepstrum analysis in machine condition monitoring is for signals containing families of harmonics and sidebands where it is the whole family rather than individual frequency component characterizing the fault [13] (typical for bearing CM signals). The technical terms used in a cepstrum are quefrency, gamnitude and rahmonics corresponding to frequency, amplitude and harmonics in a spectrum analysis. The cepstrum can be used for the harmonics generated by bearing faults, but only if they are well separated. In contrast, envelope analysis to be introduced in the next section is not limited by such restriction.

Envelope spectrum
It has been pointed out by Randall [13] that the benchmark method for rolling element bearing diagnosis is envelope analysis as the spectrum of raw bearing CM signals often contains little information about bearing faults. In envelope analysis, a time waveform is bandpass filtered in a high-frequency band and the fault signal is amplified by the structural resonances and amplitude modulated to form the envelope signal for bearing diagnosis [13]. The procedures for envelope analysis are briefly described below: Given a real signal xðtÞ, its Hilbert transform, hðtÞ ¼ HfxðtÞg is defined as The real signal xðtÞ and its Hilbert transform hðtÞ together form a new complex analytic signal zðtÞ, The envelope signal EðtÞ is simply the absolute value of the analytic signal zðtÞ, After taking a fast Fourier transform on the envelope signal EðtÞ, an envelope spectrum can be obtained. An envelope spectrum can reveal the repetition characteristic defect frequencies caused by bearing faults even in the presence of a small random fluctuation. The envelope spectrum of the noise-added bearing defect signal shown in Figure 2(b) is given in Figure 4. It is noted that due to a high noise level in the signal (SNR = 0.05, representing an initial bearing defect), the envelope spectrum is compromised by many artificial frequency components and produces a subtle bearing defect information. Also due to such interference, the calculated defect frequency (the modulated frequency) of the outer race fault and its higher harmonics

Higher-order spectra
Higher-order spectra (HOS) (also known as polyspectra) consist of higher-order moment of spectra, which are able to detect nonlinear interactions between frequency components [14]. Higherorder spectra are defined as the Fourier transform of the corresponding cumulant sequences of a signal. For instance, the first-order cumulant of a stationary process is the mean of the signal: The second-and third-order cumulants of a stationary process are defined as and From Eqs. (23) and (24), it is clear that the power spectrum is, in fact, the FT of the second-order cumulant of a signal and the third-order spectrum also termed as bispectrum is the FT of the third-order cumulant. Note that the bispectrum S 3x ðω 1 , ω 2 Þ is a function of two frequencies.
Therefore, it can detect phase coupling between two frequency components which appears as a third frequency component at the sum or difference of the first two (frequencies) with a phase that is also the sum or difference of the first two.
Traditionally, power spectrum is used to break down a time waveform signal into a series of frequency components. However, power spectrum cannot determine whether peaks at harmonically related positions are phase coupling since power spectrum uses only the magnitude of Fourier components and the phase information is neglected. Higher-order spectra such as bispectrum use the phase information of the signal and are capable to detect phase coupling of frequency components in the spectra. Therefore, a bispectrum can provide additional phase information than a power spectrum analysis.
The motivation behind the use of higher-order spectrum analysis is summarized as follows. Firstly, the technique can suppress Gaussian noise in the data processing of unknown spectral characteristics for fault detection, parameter estimation and classification problems. If Gaussian noise is embedded in a non-Gaussian signal, a HOS transform can eliminate the noise. On the other hand, periodic, quasi-periodic signals and self-emitting signals from complex machinery in practical applications are typical non-Gaussian signals which will be preserved in the transform. Secondly, a HOS analysis can preserve the phase information. For example, there are situations in practice in which the interaction between two harmonic components creates a third component at their sum and/or difference frequencies. Thirdly, HOS can play a key role in detecting and characterizing the type of nonlinearity in a system from its output data.
For a better illustration of the signal-processing techniques discussed above in data analysis, a number of case studies are given in the following section to exemplify the usage of the above algorithms in bearing fault detection.

Case studies
Case 1: (AR & MED de-noising): In this case study, the AR model described in Section 3.1.1 and the MED method described in Section 3.1.2 are employed in the analysis to filter the noise-added bearing defect signal shown in Figure 2(b) prior to an envelope analysis to enhance the bearing defect frequency components in the envelope spectrum for a more reliable bearing fault diagnosis. The results are presented in Figures 5 and 6. Compared to the result in Figure 4, it is shown that the two-step de-noising by the AR and MED models has successfully suppressed the artificial components and enhanced the defect signal representation in the spectrum.
Case 2 (spectrum kurtogram): The signal used in this case study is shown in Figure 7. The signal is generated by adding 0-dB white noise to the simulated bearing defect signal presented in Figure 2(a). In the analysis of the signal, the fast kurtogram algorithm [9] described in Section 3.1.3 is first employed to determine the bearing resonance band (to obtain the center   frequency and the bandwidth) having the highest band energy (corresponding to the highest kurtosis value) in the signal. A five-level fast kurtogram based on a filter band and fast Fourier transform is shown in Figure 8 and the highest band energy is found to occur in the band encircled by the white ellipse in the figure. The band is found to be centered at 3958.33 Hz (close to the simulated bearing resonance frequency of 4000 Hz as listed in Table 2) and has the bandwidth of 416.67 Hz. The band has the highest kurtosis value of 0.1 which occurs at level 4.5 in the decomposition. The next step after the decomposition is to take an envelope analysis based on the band-filtered optimum frequency band signal obtained from the fast kurtogram and the result is shown in Figure 9(a) and (b). It is shown that the spectrum kurtosis technique can detect the characteristic defect frequency from weak-bearing defect signals.
where wðτ−tÞ is a finite time window function centered at time t. The asterisk sign (*) in the time window indicates a complex conjugate and the analysis window can be regarded as the impulse response of a low-pass filter. A spectrogram, which is the squared amplitude of an  STFT transform, is often used in the display of the transformation result for signal analysis and interpretation. A major limitation of STFT is that the analysis has a uniform resolution in both time and frequency planes implying that a small time window will have a good time resolution but poor frequency resolution and vice versa. This drawback limits the technique for the analysis of signals with slow-changing dynamic only. Another time-frequency analysis technique, wavelet transform, has then been developed to overcome this limitation.

Wavelet transform
Wavelet transform (WT) is a time scale representation of a signal. It is the inner product of a signal with the translated and scaled family of a mother wavelet function ψ(t). In general, WT analysis can be categorized into three forms: a continuous wavelet transform (CWT), a discrete wavelet transform (DWT), and a more general wavelet packet decomposition (WPD).
In CWT, the wavelet transform of a continuous signal xðtÞ is calculated using where s is the scale parameter and u is the time translation. If one considers the wavelet function ψ(t) as a bandpass impulse response, then the wavelet transform is simply a bandpass analysis. Care should be taken to ensure that the decomposed signal can be perfectly reconstructed from its wavelet representation when using a wavelet transform. Thus, a WT has to meet the criterion which is also known as the admissibility condition.
A CWT is mainly used in data analysis of scientific research. In practical applications, the discrete version, a discrete wavelet transform (DWT), is more popular due to the small computation cost and excellent signal compaction properties. A DWT decomposes a signal into different frequency bands by passing it through a series of filters. In each step of decomposition, a signal is passed through a pair of low-and high-pass filters simultaneously accompanying by down sampling. A more general form of wavelet analysis is the so-called wavelet packet decomposition (WPD). In WPD, a signal is split into two parts, one contains a vector of approximation coefficients and the other contains a vector of detail coefficients in each stage of decomposition. Both the details and the approximations can be split further in the next level of decomposition which offer a great range of possibilities to decode a signal than ordinary wavelet analysis. Wavelet transform has been widely employed in signal processing for condition monitoring and fault diagnosis of rotating machine [15].

Wigner-Ville distribution
Another popular time-frequency analysis technique is Wigner-Ville distribution (WVD). WVD is the core distribution for the quadratic class of quadratic time-frequency distributions. It yields an ideal resolution for mono-component, linearly frequency-modulated signals, but produces undesired cross-terms for multicomponent and nonlinearly frequency-modulated signals. Wigner-Ville distribution can be viewed as a particular case of the Cohen class distributions which yields a time-frequency energy density computed by correlating the signal with a time and frequency translation of itself.
The WVD of a signal xðtÞ is defined by where x* denotes the conjugate of x. Thus, the Wigner-Ville integral is in fact a Fourier transform of the inner product of a signal and its conjugate with a time delay variable τ. The bilinear nature of this procedure therefore avoids the loss of time-frequency resolution in the transform such as the resolution problem encountered when performing the finite sliding time windowing in STFT.
Compared with other distributions, the WVD has the desirable property of fulfilling the marginal condition, thus the total signal energy can be calculated in time or in frequency using the Plancherel formula: The values jjxðtÞjj 2 and jjXðωÞjj 2 can be interpreted as the energy densities in time and frequency domain, respectively. This enables a direct computation of the energy present at a given time-frequency box from the WVD output.
Another important feature of WVD is that its first conditional moment for a given time t c equals the instantaneous frequency: Therefore, it can be computed as the average of all frequencies ω present in the time-frequency plane at a time t c .
In discrete form, the WVD is defined as where R[n, q] is the instantaneous correlation given by in which n is the number of samples of the analytical or interpolated form of the discrete signal x[n] and q is an odd integer.
Since the instantaneous correlation is centered on a value, the delay q is distributed between the delayed sample x½n − q=2 and the corresponding advanced sample, ½n þ q=2 . It is thus necessary to calculate the value of x½n at the two half integer positions using an interpolation. In addition, positions at the extremes of x½n are padded with zeros in order to compute the Fourier transform. A major drawback of WVD analysis is that it can induce artifacts and negative values which need to be properly compensated in the signal analysis [16].

Adaptive signal decomposition
Empirical mode decomposition (EMD) is an adaptive time-frequency analysis technique originally proposed by Huang [17] in 1998. It is based on the local characteristic time scales of a signal and can decompose the signal into a set of complete and almost orthogonal components termed as intrinsic mode functions (IMF). Lei et al. [18] provided a review on the successful application examples of EMD technique in fault diagnosis of rotating machines.
In EMD analysis, the decomposed signal can be represented by where C i ðtÞ represents the ith IMF component and r N ðtÞ is the residual after the EMD decomposition.
The IMFs represent the natural oscillatory modes imbedded in the signal and can serve as the basis functions, which are determined by the signal itself, rather than predetermined kernels. Thus, the decomposition is a self-adaptive signal process suitable for nonlinear and nonstationary data analysis.
Although the EMD technique has been successfully employed in the analysis of nonlinear and nonstationary signals in various applications, the algorithm itself also has a number of weaknesses, for instance, a lack of a solid theoretical foundation, end effects, a sifting stop criterion and extremum interpolation. To overcome some of these deficiencies, an improved EMD algorithm or a so-called ensemble empirical mode decomposition (EEMD) technique has been developed [19]. EEMD is a noise-assisted data analysis technique which imposes a white Gaussian noise into a signal and then decomposes the mixed signal by using the EMD algorithm. A major advantage of the EEMD technique is that no missing scales will be presented in the decomposition and the IMF components in different scales of the signal are automatically projected into proper reference scales established by the white noise in the background [20].

Case studies
Case 3 (EMD de-noising): In this analysis, the noise-added signal shown in Figure 7 is adaptively decomposed into 14 IMF components and a residual component using the EMD technique described in the previous section. The decomposition result is shown in Figure 10. The correlation coefficient of each IMF component with the original signal can be calculated using where x i ði ¼ 1, 2, …, nÞ is the original data, y j, i ði ¼ 1, 2, …, nÞ is the data of one of the jth IMF components and n is the data record length. The calculated correlation coefficients for the 14 IMF components and the residual are listed in Table 3.
It is shown in the figure that the first IMF component (IMF1) has the highest correlation coefficient with the original signal implying that it is most closely related to the defect signal. Therefore, an envelope analysis is undertaken on the IMF1 component in the next step of analysis. The result is shown in Figure 11 where the bearing defect frequency and its secondorder harmonic can be clearly discriminated from the envelope spectrum.  Case 4 (Signal de-noising using SVD decomposition): The singular-value decomposition described in Section 3.1.4 can be employed to filter out the noise from bearing conditionmonitoring signals to enhance the impulses produced by a bearing defect. In this approach, the one-dimension time waveform of the bearing condition-monitoring signal x ¼ ðx 1 , x 2 , …, x n Þ (as shown in Figure 12(a)) is rearranged by an incrementing process to form a Hankel matrix Aðp, gÞ which is then decomposed into three matrices using Eq. (9) in Section 3.1.4 to obtain the singular values σ i , i ¼ 1, 2, …L, whose values are shown in Figure 13(a).  The next step is to calculate the difference between the subsequent singular values which reflects the variation of the two neighboring singular values. Vector B is plotted in Figure 13(b) for illustration. When the difference between two neighboring singular values is large, they will form a larger peak in the difference vector B (see Figure 13(b)) indicating that there is a small correlation between the defect and noise signals at the corresponding singular values.
Keeping the singular values prior to as well as the largest difference peak and letting the singular values after this peak to zero and then substituting the modified singular-value vector to Eq. (9) to obtain a new Hankel matrix (note, as the first difference peak happens to be the largest peak in our case, the singular values associated with the sequential three peaks are also used in the modified singular value vector). We can then reverse the process of the first step to obtain the de-noised bearing defect signal as shown in Figure 14(a). The envelope spectrum of the de-noised signal (Figure 14(a)) is shown in Figure 14(b). It is shown that the SVD de-noise can yield a much clean spectrum mainly containing the bearing defect frequency component and its higher harmonics.

Statistical features in the time domain
Some useful statistical features obtained directly from a time waveform signal can be used to evaluate the health condition of a rolling element bearing. Such features can be grouped into two categories: (a) dimensional features and (b) nondimensional features. The group of dimensional statistical features includes peak value, root-mean-square (RMS) value, absolute mean value and variance. This feature group is closely related to bearing fault severity, for instance, their values increase as the bearing fault severity increases. Though, care needs to be taken as their values are also influenced by the working conditions such as load or rotate speed of a machine.  For a smooth and ergodic continuous time-domain signal xðtÞ, its peak value can be calculated as Its RMS value which reflects the power level of the signal is calculated by where pðxÞ is the probability density function of the signal xðtÞ, which represents the probability level that the signal xðtÞ fall into a certain interval.
Alternatively, an approximate formula can be used to calculate the RMS value as The absolute mean value is defined as Or it can be calculated using the approximate formula: Variance is used to depict the fluctuation of a signal that deviated from the center, which can be viewed as the dynamic feature of a signal. The variance of a signal xðtÞ is Figure 14. (a) The de-noised bearing defect signal after SVD decomposition; (b) the envelope spectrum.
where μ x is the mean value and σ x is the standard deviation of the signal.
Variance can also be calculated approximately using the following formula: Table 4 lists the mathematical formula for calculating the same features for a corresponding discrete time waveform, fxðnÞjn ¼ 1, 2, ⋯, Ng.
The group of nondimensional statistical features includes crest factor, shape factor, impulsion factor, clearance factor, skewness, and kurtosis factors which are the ratio of two-dimensional statistical features. This type of features is insensitive to the change amplitude or frequency in a signal and thus is not influenced by the working condition of a machine and can accurately reflect a fault condition of a bearing. The formulas for these features are listed in Table 5: Statistical features Formula Table 4. The dimensional statistical features of a discrete time series.

Statistical features Formula
Crest factor C ¼ xp xrms

Shape factor
S ¼ xrms xav Impulsion factor In Table 5, x r is the root amplitude of a continuous time waveform, which is given by Its discrete form is given by Shape factor and impulsion factor are often used to examine whether there exists pulse shocks in a signal. Clearance factor is sometimes used to examine the wear condition of a machine. Yet, the most frequently employed nondimensional features in data analysis are the skewness and kurtosis factors. The skewness is the third statistic moment that measures the degree of asymmetric distribution of a time waveform. The kurtosis is the fourth statistic moment that measures the "Peakness" of the data distribution.

Statistical features in the frequency domain
The power spectrum depicts the power amplitude level of the frequency components in a signal. If the power amplitude levels for some of the frequency components change, the weight-averaged center frequency of a power spectrum will also change. For example, if the number of frequency components with dominant amplitude in a spectrum increases, the energy distribution will be more dispersed. On the contrary, the energy distribution will be concentrated more around the dominant frequency components if there are only a few of such components in the spectrum. Hence, the fluctuation of a signal in the frequency domain can reflect the change of machine condition by observing the change in the weight-averaged center frequency of power spectrum or the disperse degree of power amplitude level distribution.
The typical statistical featured used to depict the weight-averaged center of a power spectrum are "Frequency center" and "Mean-square frequency," which are defined as follows: and Mean-square frequency : where Sð f Þ represents the power spectrum of a continuous waveform xðtÞ.
The statistical feature used to depict the disperse degree of energy distribution in the frequency domain is the "Variance of frequency," which is defined as Accordingly, for a discrete time series fxðnÞjn ¼ 1, 2, ⋯, Ng, the three frequency-domain statistical features are given by and where Δ is the sampling frequency, SðωÞ is the power spectrum of a discrete time series xðnÞ, which can be obtained using the following formula: and XðωÞ ¼ where ω is the angular frequency.

Data complexity index
The complexity of a signal can be described by two measures: entropy and Lempel-Ziv complexity.
Entropy is a measure of randomness, suggested by Shannon in 1948 [21]. Entropy is used to depict the randomness (or "uncertainty") existed in a signal or the amount of information carried by the signal.
For a discrete random variable x with probability density function fx i ji ¼ 1, 2, ⋯, Ng, the Shannon entropy is given by where pðiÞ is the probability of the ith event of the random variable x, b is the base of the logarithm which takes the common logarithm base values of either 2, e, or 10 depending on the application. For events with a probability around 0 or 1, the term pðiÞlog b pðiÞ converges to zero and the entropy is zero. For random signals with uniform probability density function such as pure noise, the entropy is maximum.
The entropy-based techniques have been widely used in bearing fault diagnosis in the last decade. It has been shown that the entropy value is closely related to the working condition of a machine and its value decreases monotonously with aggravation of faults or conditions [22]. Entropy is often combined with other techniques to capture the detail changes of the nonlinearity and nonstationary properties of a signal in machine fault diagnosis [23,24]. Typical entropies used in machine fault diagnosis are approximate entropy, sample entropy, fuzzy entropy and permutation entropy. Yet, some of these features are still not ideal and problematic in CM applications. For example, approximate entropy is heavily dependent on the data record length which could yield lower estimation value. Sample entropy uses Heaviside step function which is mutational and discontinue at the boundary. Fuzzy entropy is calculated based on the membership function which is difficult to determine accurately. Permutation entropy requires the reconstruction of phase space though the embedding dimension and time lag of the reconstructed matrix need to be selected manually which has so far limits its application.
Ziv and Lempel [25] presented a specific complexity algorithm termed as Lempel-Ziv complexity (LZC) to calculate the complexity of a finite length time series. LZC can reflect the rate for generating the new condition pattern feature as the nonlinearity of a time series grows [26]. Its value represents the degree of random variation of a time series. In their algorithm, the complexity values of a time series is evaluated based on a ''coarse-graining'' operation by which the data sequence is transformed into a pattern of only a few symbols, for example, 0 and 1 and involves data sequence comparison and number counting in one dimension only. A flow chart of the LZC algorithm is shown in Figure 15 and the process is described below: 1. Coarse-grain process to a finite binary sequence. A discrete time series A ¼ fa 1 , a 2 , ⋯, a n g is converted into a binary sequence S N ¼ {s 1 , s 2 , ⋯, s n } in the initiation and preprocessing phase.
2. Copy and Insert. A binary sequence up to s r ð1 < r < NÞ of complexity c N can be reconstructed by simply copying and inserting some of the existing vocabulary of SQv r ¼ {s 1 , s 2 , ⋯, s v } ðv < rÞ. To check the rest string S N−r ¼ {s rþ1 , ⋯, s N } can be reproduced by the same approach, the process is executed by the following steps: Step 1: Take Q r ¼ {s r } and check whether this string belongs to the vocabulary of SQv r . If so, string Q r ¼ {s r } is a simple repetition of an existing substring of SQv r (i.e., a simple "copy" of the existing vocabulary can restore it) and the complexity remains unchanged or c N ðrÞ ¼ c N ðr−1Þ.
Step2: Read the next string and take Q rþ1 ¼ {s r , s rþ1 }. Check if Q rþ1 ¼ {s r , s rþ1 } belongs to SQv rþ1 .
Step 4: Repeat the above process until S N ¼ {s 1 , s 2 , ⋯, s n } is covered. The resulting c N ðnÞ is the complexity of a given string.
3. Normalization of the complexity value. The complexity obtained above equals the number of nullification of Q. It indicates that the complexity is affected by the length of the string, or the number of data sample N. To find a robust complexity measure, Lempel and Ziv [27] suggested a normalized measure c LZ ¼ cN ðnÞ bNðnÞ , which is termed as LZC after their names and defined by It is shown in Figure 15 that the coarse-grained process of a finite time sequence serves as the basis for the LZC algorithm. Commonly employed technique in coarse-grained process is a single-scale process which converts a discrete time series A ¼ fa 1 , a 2 , ⋯, a n g into a symbolic binary sequence S N ¼ {s 1 , s 2 , ⋯, s n } using the following formula: where a ¼ ða 1 þ a 2 þ ⋯ þ a n Þ=n is the mean value of the discrete time series which is used as a threshold in the coarse-graining preprocess.
In the single-scale coarse-grain process, all segments of a discrete time series larger than or equal to the mean value are set to 1 while all segments smaller than the mean value are set to 0. The process neglects the fluctuation between the data intervals and loses many detailed information of the data series during the process. In order to capture the details contained in a discrete time series, the division interval should be reduced and a multidivision scale should be adopted so that the variation in the data can be reflected in the binary sequence at a multiscale process. The preprocess of a revised multiscale coarse grain is outlined below: 1. Divide the discrete time series into various scales. A two-scale division process is used here as an example. In this process, a discrete time series is divided into two regions, with the mean value of each region as the boundary. The same approach can be applied to a multiscale division where a discrete time series can be divided into several regions.

2.
When the first number in the discrete time series is larger than the mean value of the entire discrete time series, this point is set to 1 or 0 vice versa. Starting from the second data point of the discrete time series, the binary value is determined by comparing its value with the value of the previous point in the discrete time series. If the two points are in the same division interval, the binary value of the latter point will be the same as the previous one. When the value of the latter point increases to another division interval, the point will be assigned to the binary value of 1. When the value of the latter point decreases into another division interval, the point will be assigned to the binary value of 0.
A data sample given by Figure 16 is used here as an example to illustrate the difference between the single-scale and multiscale coarse-grain process. For single-scale coarse-grain process, the binary sequence of the data sample is S N ¼ ð1;1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0Þ which does not reflect the fluctuation between the data interval of the time series. When a two-scale coarse-grain process is employed to process the same data, the binary sequence becomes S N =(1,1,0,1,0,1,0,1,0,1,0,1). Therefore, a multiscale coarse-grain process can better capture the detailed fluctuation in a discrete time series.

Manifold learning
Classical dimensionality reduction techniques, such as multidimensional scaling (MDS) and independent component analysis (ICA), are only applicable to linearly structured data sets but not suitable for high-dimensional, nonlinear data sets such as bearing CM data. To overcome this problem, a nonlinear dimensionality reduction technique, manifold learning, has recently developed for machine fault diagnosis [28]. Manifold learning technique projects the original high-dimensional data onto a lower-dimension feature space while preserving the local neighborhood structure to detect the intrinsic structure of nonlinear high-dimensional data. Manifold learning can be realized through several algorithms including locally linear embedding (LLE), isometric feature mapping (IsoMap), local tangent space alignment (LTSA) and local preserving projection (LPP).
The application of manifold learning in mechanical fault diagnosis can be in twofolds. Firstly, fault features with a large dimension can have many redundant components, which can increase the complexity and operation time of a fault diagnosis process. Manifold learning can be used to eliminate the redundant components and extract the nonlinear features for fault classification in this case. Secondly, manifold learning can discard the noise components and extract the intrinsic manifold features related to nonlinear dynamic of a CM signal; therefore, it can also be used as a de-noise technique. It should be noted that manifold learning only operates on a matrix, so a preprocessing such as a reconstruction of phase space converting an original one-dimensional signal into a two-dimensional data is required. For a detailed information of the manifold learning algorithm and its application in fault diagnosis, interested readers are referred to [28,29].  The most commonly employed machine-learning techniques in fault diagnosis of rotating machines are hidden Markov models (HMMs), support vector machines (SVMs) and artificial neural networks (ANNs), which exploit shallow architectures either contain a single hidden layer or without a hidden layer. Such shallow architecture-based models have achieved great success both in theory and in practical applications. Though, shortcomings of these algorithms such as poor universality, lacking of theory basis in parameter selection, easier to fall into a local optimum value have limited the application of the algorithms in machine fault diagnosis.

Deep neural network
A consensus criterion for an effective bearing fault diagnosis technique is that it should not only be able to identify various bearing fault conditions but also be able to discriminate different fault severities in each fault condition [30]. This leads to a stricter requirement on identification procedures where a classifier must have a greater capability in discriminating different fault classes. When fault data contain more than one level of fault severities in each fault condition, the accuracy of fault diagnostic result using shallow architecture classifiers described in the previous section will reduce dramatically. Hinton and Salakhutdinov [31] proposed a deep learning technique to overcome the deficiencies of single-layer architecture classifiers for a better pattern recognition capability. The concept is further extended to become a so-called deep belief networks (DBNs) [32] which relieves the training difficulties of deep network structures by adopting a layer-by-layer unsupervised forward pretraining learning and then back fine-tuning mechanism. A description of the training process of a DBN is given in Figure 17.
Bengio et al. [33] proposed a deep stack auto-encoder (SAE) network by using a network structure similar to DBNs which is stacked with a number of auto-encoder networks. Furthermore, Le Cun et al. [34] proposed a convolutional neural network (CNN), a multilayer network where each layer is composed of several two-dimensional planes, to reduce the number of parameters in the learning process using a unique weight-sharing mechanism. The above cited deep-learning-based neural network algorithms have been widely employed in machine fault diagnosis nowadays [35,36].

A case study on bearing fault diagnosis
Case 5: In this case study, a combination of a four-level wavelet packet decomposition (WVD), a locality preserving projection (LPP), a particle swarm optimization (PSO) and support vector machine (SVM) algorithms are employed for an intelligent bearing fault diagnosis and recognition. Bearing condition-monitoring data on various operation conditions such as healthy bearing, outer race fault, inner race fault and ball fault acquired from a bearing fault simulation test-rig as shown in Figure 18 are used in this analysis.
Three types of bearing defects are simulated in this experiment: an outer race fault, an inner race fault and a ball fault as shown in Figure 19. The measured vibration signals from an accelerometer mounted on the test bearing house for the four bearing operation conditions (the three simulated faults and a healthy bearing) are shown in Figure 20. Hundred data sets are acquired in the experiment which are divided into two groups: one contains 70 sets of data and is used as the training data set and the other contains 30 sets of data which is used as the test data set.   Five major steps are taken in the analysis of the bearing condition-monitoring data: 1. Each data is decomposed by a four-level wavelet packet decomposition leading to 16 components of different frequency contents (mutual orthogonal subspaces). The wavelet components of the condition-monitoring data for the outer race fault are shown in Figure 21 for illustration.   each data. The energy distribution of the 16 wavelet components for the four bearing operation conditions is shown in Figure 22.
3. For the 100 data sets (of four bearing operation conditions), the process in Step (2) will generate a 400 · 16 feature set. The dimension of the large feature set can cause a problem for the following algorithm leading to misclassification of the bearing fault classes in the diagnosis step using SVM algorithm. The locality preserving projection (LPP) [37], a linear dimensionality reduction algorithm, which can effectively reduce the dimension while preserving the neighborhood structure of a data set, is utilized to reduce the feature set dimension to 400 · 3. The three-dimensional feature distribution for the four bearing operation conditions after the dimensional reduction by LPP is shown in Figure 23.   4. Particle swarm optimization (PSO) is adopted in this step to find the optimal kernel parameters and penalty factor used in SVM algorithm from the training data set to enhance the performance of the SVM algorithm. The optimization result of the PSO is shown in Figure 23 where the fitness function of the optimization is above 99% (close to zero misclassification rate) throughout the time evolution progress.

5.
Training and prediction (recognition) of the bearing experimental data using the SVM algorithm. The prediction result by the SVM is shown in Figure 24.

Conclusion
This chapter presented a comprehensive, step-by-step approach on condition monitoring of roller element bearings aiming to provide an introductory and referencing material for engineers and researchers new to the field. It summarized the most frequently employed data acquisition, signal analysis, feature and parameter extraction and fault diagnosis techniques in the current practice. Pros and cons of each technique are briefly discussed in the text. The formulation and discussion are also supported by ample tables, graphs and figures throughout the text for a better illustration and understanding of how to utilize the techniques presented in the chapter for real-life problems.