Process Fault Diagnosis for Continuous Dynamic Systems Over Multivariate Time Series

Fault diagnosis in continuous dynamic systems can be challenging, since the variables in these systems are typically characterized by autocorrelation, as well as time variant parameters, such as mean vectors, covariance matrices, and higher order statistics, which are not handled well by methods designed for steady state systems. In dynamic systems, steady state approaches are extended to deal with these problems, essentially through feature extraction designed to capture the process dynamics from the time series. In this chapter, recent advances in feature extraction from signals or multivariate time series are reviewed. These methods can subsequently be considered in a classical statistical monitoring framework, such as used for steady state systems. In addition, an extension of nonlinear signal processing based on the use of unthresholded or global recurrence quantification analysis is discussed, where two multivariate image methods based on gray level co-occurrence matrices and local binary patterns are used to extract features from time series. When considering the well-known simulated Tennessee Eastman process in chemical engineering, it is shown that time series features obtained with this approach can be an effective means of discriminating between different fault conditions in the system. The approach provides a general framework that can be extended in multiple ways to time series analysis.


Introduction
In the process industries, advanced process control is widely recognized as essential to meet the challenges arising from the trend toward more complex, larger scale circuit configurations, plant-wide integration, and having to make do with fewer personnel. In these environments, characterized by highly automated process operations, algorithms to detect and classify abnormal trends in process measurements are critically important.
Process diagnostic algorithms can be derived from a continuum spanning first principle models on one end to entirely data driven or statistical models on the other. The latter is typically based on historical process data and is seen as the most cost effective approach to deal with complex systems. As a consequence, diagnostic methods have seen considerable growth over the last couple of decades.
Data-driven fault diagnosis can be traced back to control charts invented by Walter Shewhart at Bell Laboratories in the 1920s to improve the reliability of their telephony transmission systems. In these statistical process control charts, variables of interest were plotted as time series within statistical upper and lower limits. Shewhart's methodology was subsequently popularized by Deming and these statistical concepts, such as Shewhart control charts (1931), cumulative sum charts (1954), and exponentially weighted moving average charts were well established by the 1960s [1].
These univariate control charts do not exploit the correlation that may exist between process variables. In the case of process data, crosscorrelation is present, owing to restrictions enforced by mass and energy conservation principles, as well as the possible existence of a large number of different sensor readings on essentially the same process variable. These shortcomings have given rise to multivariate methods or multivariate statistical process control and related methods that have proliferated exponentially over the last number of years. These approaches can be viewed on the basis of the elementary operations involved in the fault diagnostic process, as outlined in Figure 1 [2].
In this diagram, (i) a data matrix ( X), representative of the process, is preprocessed or transformed to (ii) data matrix X and then mapped to (iii) a feature space (F) within some bounded region (iv) L F . These features can be used to (v) reconstruct the data (X), from which (vi) an error matrix (E) is generated, with scores again mostly confined to some bounded region L E (vii).
Fault detection and fault diagnosis are typically done in both the feature space (F) and the error space (E), based on the use of forward (ℑ) and reverse mapping (ℜ) models and suitable confidence limits L F and L E for the feature and error spaces. Alternatively, forward mapping in to the feature space can be only used for process monitoring.
Preprocessing of the data prior to fault diagnosis has received considerable attention over the last decade or so as a basis for the development of methods that can deal with nonlinearities in the data, lagged variables and unfolding of higher dimensional data. These approaches will mostly be discussed in the second part of the chapter.

Steady state systems
Linear steady state Gaussian processes and the use of principal component analysis will first be considered as an example on the basis of this general framework, after which other methods proposed over the last few decades will be reviewed.
As mentioned in the previous section, univariate control charts do not exploit the correlation that may exist between process variables and when the assumptions of linearity, steady state, and Gaussian behavior hold, multivariate statistical process control based on the use of principal component analysis can be used very effectively for early detection and analysis of any abnormal plant behavior. Since principal component analysis plays such a major role in the design of these diagnostic models, a brief outline of the methodology is in order.
Analysis, monitoring, and diagnosis of process operating performance based on the use of principal components is well established. The basic theory can be summarized as follows, where X ∈ R NxM comprises the data matrix representative of the process with M variables and N observations, S is the covariance matrix of the process variables typically scaled to zero mean and unit variance, P is the loading matrix of the first k < M principal components, Λ is a diagonal matrix containing the k eigenvalues of the decomposition, e P is the loading matrix of the M À k remaining principal components, and e Λ is a diagonal matrix containing the M À k remaining eigenvalues of the decomposition. The T 2 and Q-diagnostics (Eqs. 2 and 3) are commonly used in process monitoring schemes.
In classical multivariate statistical process control based on principal component analysis, the control limits required for automated process monitoring are based on the assumption that the data are normally distributed. The α upper control limit for T 2 is calculated from N observations based on the F-distribution, that is, Then upper control limit for Q is calculated by means of the χ 2 distribution as: The standard normal deviates, c α corresponding to the upper (1Àα) percentile, while M is the total number of principal components (variables). The residual Q i is more likely to have a normal distribution than the principal component scores, since it is a measure of the nondeterministic behavior of the system.

Unsteady state systems
Unlike steady state systems, unsteady state or dynamic systems show time dependence. This time dependence implies the presence of autocorrelation and/or nonstationarity [3]. Autocorrelation arises when the observations within a time series are not independent, while nonstationarity means that the parameters governing a process change over time, for example, the mean, covariance or other higher order statistics. Therefore, in principle at least, these systems cannot be treated directly by the methods dealing with steady state systems.
Broadly speaking, methodologies dealing with dynamic process systems are all aimed at dealing with the issues arising from the time dependence of the data. Essentially, these approaches are based on the analysis of a segment of the time series data, as captured by a fixed or a moving window, as indicated in Figure 2. The time series segment amounts to observation of the process over a time interval, and the window length should be sufficient to capture the dynamics of the systems.
Dynamic process monitoring can be as simple as monitoring the mean or the variance of a signal, in which case, a test window as shown in Figure 2 would not be required, and model maintenance would not be an issue. In more complex systems, as could be characterized by large multivariate sets of signals or high-dimensional signals, such as streaming video or hyperspectral data, feature extraction is often model-based. That is, a model derived from the data in the base window is applied to the data in the test window. For example, principal component models can be used for this purpose.
Where models are used and the nature of the signals changes as a result of process drift, recalibration of the models need to be done either at regular intervals or episodically, that is, when a change occurs. Some models, such as those based on principal and independent components can be updated recursively, as discussed in more detail in Sections 4 and 5. Alternatively, the model is updated ab initio at regular intervals.
Moreover, most feature extraction methods are unsupervised, that is, the time series data are unlabeled. Where supervised methods are used, features are extracted based on their ability to predict some label, such as the future evolution of the time series.

Unsupervised feature extraction
In principle, any low-dimensional representation of the time series data would constitute a feature set, that is, the data in the time series window, X ∈ R NxM containing N measurements of the M plant variables with time lagged copies of these variables. These features can subsequently be dealt with by the same methods used for steady state systems, such as principal component analysis, independent component analysis, kernel methods, etc., some of which are considered in more detail below.

Dynamic principal component analysis (DPCA)
In dynamic PCA, first proposed by Ku et al. [4], the PCA model is built on the data matrix X residing in the window, to account for auto-and crosscorrrelation between variables. This approach implicitly estimates the autoregressive structure of the data (e.g., [5]). As functions of the model, the T 2 and Q-statistics will also be functions of the lag parameters. Since the mean and covariance structures are assumed to be invariant, the same global model is used to evaluate observations at any future time point.
Although dynamic PCA is designed to deal with autocorrelation in the data, the resultant score variables will still be autocorrelated or even crosscorrelated when no autocorrelation is present [4, 6]. These autocorrelated score variables have the drawback that they can lead to higher rates of false alarms when using Hotelling's T 2 statistic.
Several remedies have been proposed to alleviate this problem, for example, wavelet filtering [7], ARMA filtering [6], and the use of residuals from predictive models [8]. Nonlinear PCA models have been considered by several authors [9-13].

Stefatos and Hamza [14]
and Hsu et al. [15] have introduced diagnostic methods using an approach based on dynamic independent component analysis capable of accurately detecting and isolating the root causes of individual faults. Nonlinear variants of these approaches have been investigated by Cai et al. [16], who have integrated the kernel FastICA algorithm with a manifold learning method known as locality preserving projection. Moreover, kernel FastICA was used to integrate FastICA and kernel PCA to exploit the advantages of both algorithms, as indicated by Zhang and Qin [17], Zhang [18], and Zhang et al. [19].

Slow feature analysis
Slow feature analysis [20] is an unsupervised learning method, whereby functions g x ð Þ are identified to extract slowly varying features y t ð Þ from rapidly varying signals x t ð Þ. This is done virtually instantaneously, that is, one time slice of the output is based on very few time slices of the input. Extensions of the method have been proposed by other authors [21][22][23].

Multiscale methods
Multiscale methods can be seen as a complementary approach preceding feature extraction from the time series. In this case, each process variable is extended or replaced by different versions of the variable at different scales. For example, with multiscale PCA, wavelets are used to decompose the process variables under scrutiny into multiple scale representations before application of PCA to detect and identify faulty conditions in process operations. In this way, autocorrelation of variables is implicitly accounted for, resulting in a more sensitive method for detecting process anomalies. Multiscale PCA constitutes a promising extension of multivariate statistical process control methods, and several authors have reported successful applications thereof [24][25][26][27].

Wavelets
Bakshi [28,29] has proposed the use of a nonlinear multiscale principal component analysis methodology for process monitoring and fault detection based on multilevel wavelet decomposition and nonlinear component extraction by the use of input-training neural networks. In this case, wavelets are first used to decompose the data into different scales, after which PCA was applied to the reconstituted time series data. Choi et al. [30] have proposed nonlinear multiscale multivariate monitoring of dynamic processes based on kernel PCA, while Xuemin and Xiaogang [31] have proposed an integrated multiscale approach where kernel PCA is used on measured process signals decomposed with wavelets and have also proposed a similarity factor to identify fault patterns.

Singular spectrum analysis
With SSA, the time series is first embedded into a p-dimensional space known as the trajectory matrix. Singular value decomposition is then applied to decompose the trajectory matrix into a sum of elementary matrices [32][33][34], each of which is associated with a process mode.
Subsequently, the elementary matrices that contribute to the norm of the original matrix are grouped, with each group giving an approximation of the original matrix. Finally, the smoothed approximations or modes of the time series are recovered by diagonal averaging of the elementary matrices obtained from decomposing the trajectory matrix. Although SSA is a linear method, it can readily be extended to nonlinear forms, such as kernel-based SSA or SSA with autoassociative neural networks. Nonetheless, it has not been used widely in statistical process monitoring as yet, although some studies have provided promising results [2, 35,36]. Table 1 gives a summary of multiscale methods that have been considered in process monitoring schemes over the last two decades.

Phase space methods
Phase space methods rely on the embedding of the data in a so-called phase space, by the use of delayed vector methods, that is, . Embedding can also be done by the use of principal components or singular value decomposition of X ∈ R NÀmþ1 ð Þ Â m , where k ¼ 1 and m is comparatively large. In the latter case, the scores of the eigenvectors would represent an orbit or attractor with some geometrical structure, depending on the frequencies with which different regions of the phase space are visited. of this attractor is a direct result of the underlying dynamics of the system being observed, and the changes in the topology are usually an indication of a change in the parameters or structure of the system dynamics. Therefore, descriptors of the attractor geometry can serve as sensitive diagnostic variables to monitor abnormal system behavior.

Phase space attractor descriptors
For process monitoring purposes, the data captured in a moving window are embedded in a phase space, and descriptors such as correlation dimension [49][50][51], Lyapunov exponents, and information entropy [49] have been proposed to monitor deterministic or potentially chaotic systems. These approaches have not found widespread adoption in the industry yet, since the reliability of the descriptors may be compromised by high levels of signal noise.

Complex networks
Process circuits or plants lend themselves naturally to representation by networks and process monitoring schemes can exploit this. For example, Cai et al. [52] have essentially considered a lagged trajectory matrix in the form of a complex network, whereby the variables and their lagged versions served as network vertices. The edges of the network were determined by means of kernel canonical correlation analysis (a nonlinear approach to correlation relationships between sets of variables). Features were extracted from the variables based on the dynamic average degree of each vertex in the network. A standard PCA model, as described in Section 1.1 was consequently used to monitor the process. Case studies have indicated that this could yield considerable improvement in the reliability of the model to detect process disturbances.

Local recurrence quantification analysis
Any given sequence of numbers or time series can be characterized by similarity matrix containing measures of similarity (e.g., Euclidean distances) between all pair-wise points in the time series. A recurrence matrix is generated by binary quantization of the similarity matrix, based on a user specified threshold value. This thresholded matrix can be portrayed graphically as a recurrence plot, amenable to qualitative interpretation. The recurrence matrix, consisting of zeros and ones, can also be used as a basis to extract features that are representative of the dynamic behavior of the time series. This approach is widely referred to as recurrence quantification analysis, and in process engineering, it has mainly been used in the description of electrochemical phenomena and corrosion [53-58], but in principle has general applicability to any dynamic system.

Global recurrence quantification analysis
More recent extensions of recurrence quantification analysis have been considered by using the unthresholded similarity matrix as a basis for feature extraction. This is also referred to as global, as opposed to (local) recurrence quantification described in Section 2.5.3. The resulting recurrence plot can consequently be treated as an artificial image amenable to analysis by a large variety of algorithms normally applied to textural images, as discussed in more detail in Section 4.

Autoregressive models
Autocorrelated data can be addressed by fitting models to the data and analyzing the residuals, instead of the variables. With ARIMA models, crosscorrelation between the variables is not accounted for, and although multivariate models can also be employed using this approach, it becomes a complex task when there are many variables (m > 10), owing to the high number of parameters that must be estimated, as well as the presence of crosscorrelation [3,59].

State space models
If it is assumed that the data matrix X contains all the dynamic information of the system, then the use of predictive models can be viewed as an attempt to remove all the dynamic information from the system to yield Gaussian residuals that can be monitored in the normal way. State space models offer a principled approach for the identification of the subspaces containing the data. This can be summarized as follows x k and y k are the respective state and measurement vectors of the system, and w k and v k are the plant disturbances and measurement errors, respectively, at time k. State space models and their variants have been considered by several authors [65][66][67][68][69][70][71][72][73][74][75].

Machine learning models
In principle, machine learning models are better able to deal with complex nonlinear systems than linear models, and some authors have considered the use of these approaches. For example, Chen and Liao [62] have used a multilayer perceptron neural network to remove the nonlinear and dynamic characteristics of processes to generate residuals that could be used as input to a PCA model for the construction of simple monitoring charts. Guh and Shiue [63] have used a decision tree to detect shifts in the multivariate means of process data. Auret and Aldrich [48] have considered the use of random forests in the detection of change points in process systems. In addition, Aldrich and Auret [2] have compared the use of random forests with autoassociative neural networks and singular spectrum analysis in a conventional process monitoring framework.
The application of deep learning in process monitoring is an emerging area of research that shows particular promising. This includes the use of stacked autoencoders [76], deep long short term memory (LSTM) neural networks [77], and convolutional neural networks. Table 2 gives an overview of the feature extraction methods that have been investigated over the last few decades.

Case study: Tennessee Eastman process
Finally, as an example of the application of a process monitoring scheme incorporating feature extraction from time series data in a moving window, the following study can be considered. It is based on the Tennessee Eastman benchmark process widely used in these types of studies. The feature extraction process considered here is an extension of recurrent quantitative analysis discussed in Section 2.5.2. Instead of using thresholded recurrence plots, unthresholded or global recurrence plots are considered, as explained in more detail in below.

Tennessee Eastman process data
The Tennessee Eastman (TE) process as proposed by Downs and Vogel [97] and has been used as a benchmark in numerous process control and monitoring studies [98]. It captures the dynamic behavior of an actual chemical process, the layout of which is shown in Figure 3.
The plant consists of 5 units, namely a reactor, condenser, compressor, stripper and separator, as well as eight components (four gaseous reactants A, C, D, E, one inert reactant B, and three liquid products F, G, H) [97]. In this instance, the plantwide control structure suggested by Lyman and Georgakis [99] was used to simulate the process and to generate data related to varying operating conditions. The data set is available at http://web.mit.edu/braatzgroup.
A total of four data sets were used, that is, one data set associated with NOC and the remaining three associated with three different faults conditions. The TE process comprises 52 variables, of which 22 are continuous process measurements, 19 are composition measurements, and the remaining 11 are manipulated variables. These variables are presented in Table 3. Each data set consisted of 960 measurements sampled at 3 min intervals.
The NOC samples were used to construct an off-line process monitoring model that consisted of a moving window of length b, moving sliding along the time series with a step size s. The three fault conditions are summarized in Table 4. Fault conditions 3, 9, and 15 are the most difficult to detect, and many fault diagnostic approaches fail to do so reliably.
In this case study, the approach previously proposed by Bardinas et al. [96] is applied to the three fault conditions in the TE process. The methodology can be briefly summarized as shown in Figure 4.
A window of user defined length b slides along the time series (A) with a user defined step size s, yielding time series segments (B), each of which can be represented by a similarity matrix (C) that is subsequently considered as an image from which features can be extracted via algorithms normally used in multivariate image analysis (D).

Feature extraction
Two sets of features were extracted from the similarity or distance matrices, namely features from the gray level co-occurrence matrices of the images, as well as local binary pattern features, as briefly discussed below.

Gray level co-occurrence matrices (GLCMs)
GLCMs assign distributions of gray level pairs of neighboring pixels in an image based on the spatial relationships between the pixels. More formally, if y i; j ð Þ is an element of a GLCM associated with an image I of size R Â S, having L gray levels, then y i; j ð Þ can be defined as   where r; s ð Þ and r þ Δr; s þ Δs ð Þ denote the positions of the reference and neighboring pixels, respectively. From this matrix, various textural descriptors can be defined. Only four of these were used, as defined by Haralick et al. [100], namely contrast, correlation, energy, and homogeneity.

Local binary patters (LBPs)
LBPs are nonparametric descriptors of the local structure of the image [101]. The LBP operator is defined for a pixel in the image as a set of binary values obtained by comparing the center pixel intensity with its neighboring pixels. If the neighboring pixel exceeds the intensity of the center pixel value, this pixel is set to 1 (otherwise 0). Formally, given the central pixel's coordinates x 0 ; y 0 À Á , the resulting LBP can be obtained in the decimal form as where the gray level intensity value of the central pixel is i 0 , that of its p'th neighbor is i p . Moreover, the function s • ð Þ is defined as

Selection of window length and step size
Apart from the selection of a feature extraction method, one of the main choices that need to be made in the process monitoring scheme is the length of the sliding window. If this is too small, the essential dynamics of the time series would not be captured. On the other hand, if it is too large, it would result in a considerable lag before any change in the process can be detected. There is also the possibility that transient changes may go undetected altogether. In the case of a moving window, the step size of the moves also needs to be considered. The selection of these two parameters can be done by means of a grid search, and the results of which are shown in Figure 5.
As indicated in Figure 5, the optimal window size was b = 1000 and the step size was s = 20 for both the GLCM and LBP features that were used as predictors. With these settings, the 500-tree random forest model [102] was able to differentiate between the normal operating conditions and the three fault classes with a reliability of approximately 82%. In Figure 6, principal component score plots of the two optimal feature sets are shown. The large LBP feature set could not be visualized reliably, as the first two principal components could only capture 52.5% of the variance of the features. The variance in the smaller GLCM feature set could be captured with high reliability by the first two principal components of the four features. Here, the differences between the normal operating data ("0" legend) and the other fault conditions are clear ("3," "9," and "15").

Discussion
The approach outlined in Section 2.5.4. considered in more detail in the above case study is an extension of recurrent quantification analysis with the advantage that information is not lost when the similarity matrix of the signal is thresholded. Also, while thresholding does not preclude the use of a wide range of feature extraction algorithms, recurrent quantification has mostly been applied to dynamic systems based on a set of engineered features that allow some modicum of physical interpretation.
In most diagnostic systems, this is not essential and therefore more predictive feature sets may be constructed. These features could be engineered, as was considered in the case study or they could be learned, by taking advantage of stateof-the-art developments in deep learning.  In addition, the following general observations can be made not only with regard to the approach considered in this case study but also to other approaches reviewed in this chapter.
• Most of the nonlinear approaches used in steady state systems can be used in dynamic systems, and as a consequence, principal and independent component analysis and kernel methods have figured strongly in recent advances in dynamic process monitoring.
• With the routine acquisition of ever larger volumes of data and more complex processing, it can be expected that the field will continue to benefit from advances in machine learning. The application of deep learning methods in particular is a highly promising emerging area of research.
• Likewise, dynamic process monitoring is also likely to continue to benefit from closely related fields, such as process condition monitoring, structural health monitoring, change point detection, and novelty detection in other engineering or technical systems.
• As with steady state process monitoring, fault identification has received comparatively little attention to date.

Conclusions and future work
Data-driven fault diagnosis of dynamic systems has advanced considerably over the last decade or more. In this chapter, the large variety of algorithms currently available has been discussed in terms of a feature extraction problem associated with the data captured by sliding a window across the time series or in some cases making use of a fixed window. These features could be used in statistical process monitoring frameworks that are well established for steady state systems.
In addition, extension of a recent approach to nonlinear time series analysis, namely recurrence quantification analysis, has been considered and shown to be an effective means of monitoring dynamic process systems, such as represented by the Tennessee Eastman benchmark problem in chemical engineering.
As mentioned in Section 4.4., a wide range of feature extraction algorithms can be used with unthresholded or global recurrence quantification analysis. In future work, the application of convolutional neural networks to extract features from global recurrence plots will be considered. This does not necessarily require a large amount of data, as pretrained networks, such as AlexNet, ResNet, and VGG architectures, and others could possibly be used as is, in what would essentially be a texture analysis problem, similar to the work done by Fu and Aldrich [103,104] in the recognition of flotation froth textures, for example.

Conflict of interest
The author declares no conflict of interest in this contribution.