Open access

Diagnosis of Intermittent Faults and its dynamics

Written By

A. Correcher, E. Garcia, F. Morant, E. Quiles and L. Rodriguez

Published: March 1st, 2010

DOI: 10.5772/9507

Chapter metrics overview

2,241 Chapter Downloads

View Full Metrics

1. Introduction

Intermittent faults (IFs) are difficult to diagnose and may cause a great disruption in industrial processes. Most IFs are related to gradual degradation of components or systems. For instance, evolution of connection failures is shown in Fig. 1 (Correcher et al., 2004), (Sorensen et al, 1998). Connection failures are rarely repaired so its behaviour worsens over time. Intermittent faults behave as small noise fluctuations in stage 1 of their development. As the amplitude and duration of the fluctuations increase (stage 2), IFs start to occur. The effects of IFs are severe in stage 3.

Therefore, in many instances, the occurrence of IFs in a device is a prelude of permanent failures (PFs). In these cases, if IFs can be detected then appropriate actions could be taken in order to minimise the economic impact.

In (Correcher et al., 2004) an IFs diagnosis tool was presented by the authors. This tool was able to diagnose the failure and recovery events in a system with IFs. This paper presents an extension of the work in (Correcher et al., 2004) which includes not only event detection but also fault dynamics detection, defined as the evolution of IFs occurrence over time. Other approaches to IFs diagnosis (Contant et al., 2004), (Jinag et al. 2003), do not consider IF dynamics.

Figure 1.

Connection IF evolution through device life.

However, the existence of IF dynamics is experimentally shown in (Sorensen et al., 1998) and in the destructive tests presented in this paper. Therefore, we can introduce the diagnosis problem to be addressed.

Definition 1: IF diagnosis problem.

Starting from a temporal input (U) and output (Y) sequence, obtained by means of sensors in the process, compute the presence of any failure f (with f included in a failure set), its recoveries and its dynamics.

The proposed solution has a clear industrial interest, as not only diagnoses IFs, but also provides valuable information on the fault evolution for maintenance purposes. First, section 2 presents a study about the evolution of the IF during its online diagnosis. This approach is able to extract some characteristic parameters of the IF. These parameters will be useful for estimating the behaviour of the fault in the future. The approach is also applied to experimental data.

Section 3 analyses the effectiveness of the approach in the solution of the problem stated in definition 1 when diagnosing Discrete Event Systems (DES) (Cassandras & Lafortune, 1999). Section 4 presents an application of IF dynamics diagnosis with Coloured Petri Nets. Finally, we present some conclusions in section 5.


2. Temporal modeling

IF dynamics characterization generates useful information for preventive maintenance scheduling. Two complementary parameters are defined in this section: temporal failure density (DF) and pseudoperiod (Ps). The goal of these parameters is to characterize the IF dynamics. DF and Ps can be on-line computed. DF and Ps can also predict future behaviour of the faulty device.

DF and Ps are computed from IF time occurrence and IF duration (defined as the time difference between fault and recovery). Therefore, two arrays are computed for each fault: the fault time vector (FT Fj =[FT (1)Fj ,FT (2)Fj ,…,FT (n)Fj ]) and the duration time vector (T Fj =[T (1)Fj , T (2)Fj ,…,T (n)Fj]), where F j stands for a fault included in the fault set and "n" is the index number of detected faults. Arrays and parameters are computed recursively on-line considering a moving time window of a given duration.

2.1. Temporal failure density

Temporal failure density (DF or density in the rest of the paper) is defined as the average time a particular fault (Fj) is active within a sliding time window of duration W. Therefore, if we define the current time as "ti", the density is computed from "ti-W" to "ti". Therefore, DF is computed for time "ti" as:

D F t i = i = k C N T ( T ( i ) F j T A ) W ; C N T 0 E1

where CNT is the number of faults inside the window, "k" stands for the index of the first fault detected inside the window {k: FT(k)Fj(ti-W) }(k: FT( k ) F j > (ti-W) and FT( k-1 ) F j<(ti-W)} if exists, otherwise {k=CNT+1} and "T A " takes into account the duration of a fault occurred before "ti-W" which continues active inside the window. Therefore:

T A = F T ( k 1 ) F j + T ( k 1 ) F j ( t i W ) E2

Equation (2) is valid only if "T A " is positive, otherwise "T A =0", as this fact would indicate that the first considered fault is completely outside of the window.

In a real system, DF tends to increase with time; thus confirming the hypothesis that IFs progressively damage the faulty device. Figures 2 and 3 show the computed DF from experimental data and its filtered signal (low-pass fourth order Butterworth filter). The experimental data has been obtained from ten million operation tests on relays switching a resistive load. As the time between operations was 100 milliseconds, the overall duration of each experiment was 277.8 hours.

Figure 2.

Fault density. Window size is 10000 operations.

Figure 3.

Fault density. Window size is 100000 operations.

The rising characteristic of the density can be used to estimate the optimal time to repair or substitute the faulty device. Effectively, a certain maximum density threshold can be defined as the limit for unacceptable behaviour. Then, an adequate extrapolation model can be used to predict the number of operations the device is capable of carrying out before reaching the unacceptable behaviour limit. Obviously, the unacceptable density threshold should be defined specifically for each process device, depending on its specific functionality and reliability requirements.

The simplest prediction model consists of a density linear increase. In this case, filtered density data can be treated with classical techniques (Lemmis, 1995) such as least squares (LS) or recursive least squares (RLS). Therefore, we obtain a model:

D = m t i t + n t i E3

where D stands for the density, t stands for the time and the subindex ti stands for the time value when density is estimated. If we consider a threshold "Do", the time (tDo) when the density "D0" will be reached is:

t D 0 = ( D 0 n t i ) m t i E4

Therefore, "t D0 " is the time when the faulty device should be replaced. This time is named as Linear Substitution Time at time "ti" (LST ti ). In addition, it is possible to define another parameter much more suitable for preventive maintenance: Operations to Replacement at time "ti" (OTR ti ). This parameter represents an estimation of the useful operations left on a device, and can be computed as:

O T R t i = L S T t i t i E5

Obviously, only positive values of OTR ti are meaningful, otherwise the corresponding OTR ti is considered equal to zero.

The proposed linear prediction model has been found to be suitable for the use with the experimental used through the chapter. However, this kind of model might not be adequate for other devices. For instance, if the fault density follows first order dynamics then LSTti will predict an optimistic substitution time. This problem can be solved by using RLS with forgotten factor to fit LST ti . In any case, LST ti and OTR ti reflect the underlying fault dynamics, and could be used to model systems that do not follow linear increase laws.

Therefore, the fault diagnosis system will predict the time when a device must be replaced in two stages. First, fault density will be computed and, from this value, LST ti and OTR ti will be predicted.

As mentioned before, the sliding window size should be appropriately chosen, as short windows will imply high variability and noise in the calculated failure density and long windows, which exhibit a greater filtering effect, involve high computational costs and might mask part of the fault underlying dynamics.

Figures 2 and 3 show the effect of window size in the variability of the density. From calculations with a range of window sizes, it has been found that windows greater than 5000 operations include the same low frequency component, as shown in figs. 2 and 3. Therefore, LST ti and OTR ti computed from window sizes greater than 5000 operations are identical. Figure 4 shows the evolution of OTR ti obtained from the experiments in figs 2 and 3. Figure 4 shows that the device should be replaced when reaching 6 million operations instead of the substitution time recommended by the manufacturer (10 million operations).

Figure 4.

OTR ti for an acceptable density threshold below 15%.

2.2. Pseudoperiod.

Fault density can be used to predict the device substitution time, however it does not completely explain IF dynamics. For instance, fig. 5 shows two cases with exactly the same failure density. However, the effects on the device are clearly different.

Figure 5.

IF with the same density but different dynamics.

The difference between the two behaviours can be modelled by the time difference between the occurrences of two consecutive faults.

The time difference can be measured with a new parameter, the Pseudoperiod (Ps). Ps is defined as the average time difference between faults inside a sliding window. Moreover, Ps is normalized by the number of faults, and can be computed at time "ti" as:

P s t i = i = j i = k F T i + 1 F T i k j + 1 E6

where "ti" stands for current time, "FT" is the detection time for fault "i", and "j", "k" are the first and last fault indexes in the window, respectively.

Pseudoperiod is clearly a magnitude related to mean time between failures (MTBF), commonly used to model reliability of reparable systems. Moreover, pseudoperiod is a dynamic magnitude. Therefore, we can compute a Ps curve for any IF. This curve can be used to predict the substitution time of the device.

The evolution of Pseudoperiod (figs. 6 and 7) shows an increase until a maximum value is reached, to decrease towards a value close to zero. This dynamic behaviour is consistent with the nature of IFs (fig. 1). The computed Pseudoperiod remains in the range 600 to 800 from 4 million operations onwards. Therefore, it is possible to conclude that, in average, the number of failures from the 4 millionth operation remains reasonably constant. Moreover, the average duration of each failure slowly increases with the number of operations, as the density (fig. 3) increases.

Figure 6.

Pseudoperiod. Window size is 100000 operations.

Figure 7.

Zoom in Pseudoperiod. Window size is 100000 operations.

Pseudoperiod can be used to compute some limit in the desirable behaviour of the system. The shape of the Pseudoperiod signal suggests the estimation of this limit with RLS with forgetting factor. Another solution could be a mixed model with a polynomial and linear estimation for each side of the signal. This last approach seems to be more promising but its computing is no trivial. Continuity and derivability must be guaranteed. Moreover, the order of the filter used in the Pseudoperiod signal will affect in the delay of the model. These problems will be addressed in future works.


3. Intermittent fault dynamics diagnosability

It is necessary to ascertain if the proposed dynamic parameters can be used to perform the complete fault diagnosis as per definition 1. To complete definition 1, a definition of fault dynamics diagnosis is introduced, based on the definition of discrete-time systems observability (Smolensky et. al, 1996) "A discrete-time system is observable if a finite k exists such that knowledge of the outputs to k-1 is sufficient to determine the initial state of the system. "

Definition 2. IF dynamics diagnosis problem.

Given a temporal fault sequence, T F = { t f i }   ( i = 0... n ) , and a temporal recovery sequence, T R F = { t r f j } ( j = 0... m ) , compute the next value in T F  if  m n or the next value in T R F  if  m n .

Definition 2 states that IF dynamics is diagnosable if we can compute the next time when a fault or recovery will occur. IFs are asynchronous and non-deterministic. So, IF dynamics cannot be diagnosed with deterministic precision. Therefore, we propose a relaxed definition.

Definition 3. Bounded IF dynamics diagnosis problem.

Given a temporal fault sequence, T F = { t f i }   ( i = 0... r ) , and a temporal recovery sequence, T R F = { t r f j }   ( j = 0... s ) , compute the next value in T F   i f   s r or the next value in T R F   i f   s r , with a bounded uncertainty.

This uncertainty can be used to compare different methods of diagnosis.

3.1. Bounded IF dynamics diagnosis with DFTj.

IF dynamics diagnosis start from a temporal fault event sequence and a temporal recovery event sequence until the actual time ( τ ) ( T F = { t f i } ( i = 0... s ) , T R F = { t r f j } ( j = 0... r )) . DF can be computed as in equations (1) and (2). Let us assume that T A=0, therefore, the next event will be a fault. In this case, the problem in definition 3 will consist of computing the next fault time. As historical data is known until time τ , it is possible to compute a sequence of fault densities {Dl}(l=0... t). Density RLS prediction computes the estimated density for a given instant of time υ : D υ = m υ + n , where m and n are the results of the RLS estimation. Density increases when there is a new diagnosed fault in the system. Moreover, density increases for a maximum of "tm/w", where "tm" is the sampling period and "w" is the window size. Therefore, the next failure will occur when the linear model will estimate this density value.

D τ + t m w = m T F s + 1 + n E7
T F s + 1 = w ( D τ n ) + t m w m E8

The estimation error is therefore, the error in the RLS estimation (Smolensky, et. al, 1996).

If we want to compute the time for a recovery, (TA≠0) the density will decrease in, at least, "tm/w", so:

R T F r + 1 = w ( D τ n ) t m w m E9

We can conclude that the density allows for the complete diagnosis of the IF dynamics with a bounded error.

In order to compute the IF density, the fault diagnosis system should include the identification of fault starting time and duration. The fault diagnosis system should also be able to compute the corresponding fault densities.


4. Latent Nestling.

The previous section showed that the diagnosis of an IF involves not only the diagnosis of fault and recovery events, but also the diagnosis of its dynamics. This section shows how to diagnose IF dynamics with the methodology based on Coloured Petri Nets (CPN) presented in (Garcia et al., 2008) (Rodriguez et al., 2008). This methodology allows IF dynamics diagnosis because it includes timing information. We also include a complete diagnosis example of an industrial process.

A Coloured Petri Net for Fault Diagnosis (DCPNs) is:

D = ( P , T , P o s t , M 0 , C , f , P L N f , T f , P V f ) E10


  • P is a finite set of places.

  • T is a finite set of transitions.

  • Pre and Post are input and output arc functions.

  • M0 is the initial marking.

  • C is the colour set assigned to different identifiers. C = N f is the subset of coloured tokens representing the normal system behaviour.

  • f = { f 1 , f 2 ,..., f i }

Fault verification places are P-timed. Therefore, they include pairs <R, TimR>; where R is a coloured mark and TimR is a timer. TimR will be used to compute IF Density and Pseudoperiod.

Thorough this paper the notation M(Pl( V )) {Pl e P; V e C} will refer to the marking of place "Pl". This notation represents that there is a "V" coloured token in place "Pl" (for Pre functions). This notation also represents that a "V" coloured token is placed in "Pl" (for Post functions).

4.1. System modelling

The first step of the method consists of the dynamic system modelling. The system model is designed with generalized Petri Nets (PNs) (David & Alla, 2005), for simple systems, or with CPNs (for complex systems) (Hensen, 1997).

Let us show the methodology applied to a rectifying industrial machine. The machine rectifies blocks of a synthetic compound that imitates the natural stone. Figure 8 shows the process scheme. The rectifying process consists of the mechanical elimination of some material in order to achieve the desired width. The system can be divided into four subsystems. Since each milling works with an independent motor, each subsystem will consist of a pair motor-milling with a blade cooling and lubrication system. The blade cooling system consists of a pair pump-valve that pours cutting oil over the millings. The motors can suffer Ifs resulting from fretting corrosion in their electrical contacts. The millings fail when there is any defect in the tool. This failure is a PF that can be due to maintenance failure. Milling failure will cause a great torque and, therefore, a power consumption greater that usual.

Figure 8.

Artificial Stone rectifying process.

Moreover, a fault in a previous subsystem can cause the same symptoms because the milling will cut more material. The cooling and lubrication system can also suffer IFs. Typical IFs in a pump-valve system are electrical contact failures and valve blocking.

Figure 9 and table 1 show the PN system model. Table 2 shows the relationship between places and system states.

Figure 9.

System PN model.

Table 1.

Transitions in PN system model.

Table 2.

Places in PN system model.

The next step consists of the folding to a CPN (Garcia et al., 2008). Figure 10 shows the result. The CPN system model stars in Pc1. Tr1 starts all subsystems and generates normal working tokens in Pc2. Arc functions g and g1 denote the relationship between the general normal working token with particular normal working tokens:

[ g ( n ) = C E11
[ gi ( » ) = n [ g i ( C ) = E12
C = S 1 n + S 2 n + S 3 n + S 4 n E13

where S j n ( j { 1..4 } ) stands for normal subsystem coloured token.

Transition Tr2 puts a stone in the machine. This action starts the first subsystem (place Pc4) and moves the other three subsystems to "waiting for stone arrival" state (Pc3). Function gs1 and gsk model the different paths followed by subsystems:

[ g s 1 ( S 1 n ) = S 1 n E14
[ g s k ( S 2 n + S 3 n + S 4 n ) = S 2 n + S 3 n + S 4 n E15

Transition T r 3 = { T 7, T 8, T 9 } starts sequentially the other subsystems by marking Pc4. Transition T r 4 = { T 6, T 10, T 11, T 12 } stops sequentially the cutting process of each subsystem. Transition Tr5 starts the shutdown routine for every subsystem.

4.2. Fault set definition

The next step is the fault analysis of system devices. The goal is to define the faults to be diagnosed in each device. Each fault has a coloured fault token. Therefore, the fault set consists of the union of the coloured fault tokens f = {f1, f2,, fi }. Fault isolation will be guaranteed because any fault is associated to a device. Each subsystem includes four devices: motor, milling, pump, and valve. Table 3 shows the faults included in the example:

Table 3.

Fault set definition

4.3. Places of latent nestling faults, PLNf

The next step consists of marking all fault coloured tokens in specific CPN places. These places are called "Places of latent nestling faults" (PLNf). Expert's empirical knowledge sets the rules for this operation. Figure 10 shows the marking for this example.

Figure 10.

System CPN model and fault allocation (¡={1,2,3,4}).

The computing of the thresholds to generate k(N) and Ii(S) events is not trivial, because the sensor will observe an overcurrent when the tool touches the block (normal working). Nevertheless, we can easily generate these events if having a normal working current pattern. Let us suppose that we have a current pattern for each motor: Patt(t). The continuous measure of the sensor is Ii (t). Therefore, for each time tk:

| P a t i ( t k ) - I i ( t k ) | d max n o r m a l w o r k i n g | P a t i ( t k ) - I i ( t k ) | d max f a u l t y w o r k i n g E16

where dmax is the maximum difference allowed between both signals. Therefore, we can define the three current events as:

| P a t i ( t k ) - I i ( t k ) | d max a n d { | I i ( t k ) γ I i ( 0 ) | | I i ( t k ) γ I i ( N ) | } | P a t i ( t k ) - I i ( t k ) | d max a n d { | I i ( t k ) γ I i ( 0 ) | | I i ( t k ) γ I i ( S ) | } E17

where y is close to zero and it allows noise filtering.

Flow sensors, Fl i ( i { 1 .. 4 ) } ) generate two levels: Fli(0), no flow; Fli(1), flow. Pressure sensors, Pr i ( i { 1 .. 4 ) } ) generate two levels: Pri(0), no pressure; Pri(1), pressure.

The set of sensor values is SM. The set of possible sensor values combinations for marking Mk is denoted as "SROVj(Mk)" ( j { 1 ... n ) } ) , where "n" is the number of possible combinations. "SROVj(Mk)" can be split into two subsets: SROV ( Mk ) = SROVev ( Mk ) SROVnev ( Mk) . "SROVev(Mk)" stands for the subset of expected values and "SROVnev(Mk)" stands for the subset of non expected. We use "SROVnev(Mk)" to generate the trajectories of fault verification. We use this information to build the system diagnoser. The diagnoser consists of the extension of the system model by including the trajectories of fault verification. The complete algorithm is shown in (Garcia et al., 2008). Figure 12 shows the CPN diagnoser. Figure 12 duplicates places Pc2, Pc3, and Pc4. Nevertheless, the diagnoser includes only one place Pc2, one Pc3, and one Pc4. We use the representation in figure 12 to increase the readability of the figure.

Table 4 defines the trajectories of fault verification for the diagnoser. "*" stands for any value of other sensors. Function g2 evaluates place markings and returns fault marks when necessary.

g2(Pl,V) {Pl ∈P; V∈f} :if M ( P l ( V ) ) then null else M ( P l ( V ) )

The diagnoser must solve the problem of chained faults. The main effect of any subsystem fault is less material cut. Therefore, the following subsystem will observe greater currents. Nevertheless, this situation does not involve two faults. Tf3 (table 4) solves the problem by including previous subsystem faults in the diagnosis.

Figure 12.


Table 4.

Fault and recovery transition definition

PVF place diagnoses subsystem faults. IF dynamics can also be diagnosed by collecting diagnosed faults temporal information. Table 5 shows the information required for IF dynamics diagnosis:

Table 5.

5. Fault information.

Where CNT, FT, T, Ta, DF, Ps, LST, and OTR have the same meaning as in equations (1), (2), (3), (4), (5), and (6).

The diagnoser builds a table like table 5 for each IF. Moreover, the diagnoser updates the tables each sampling time. Figure 13 shows the updating algorithm.

Figure 13.

Updating algorithm.

Therefore, the diagnoser computes LST and OTR each sampling time. Moreover, each table must be cleared when the faulty device is replaced with a new one.


5. Conclusions

This chapter presents the problem of diagnosing IFs dynamics. We have presented a way of modelling IF dynamics and we have tested it with real data. The density model allows us to compute the best time to repair or substitute the faulty device. This model does not need historical data.

IF dynamics diagnosis has the problem of determine a sliding window size. This problem will be treated in a future work.

We have stated a new definition for IF dynamics and we have proved that the model is able to diagnose the IF dynamics.

We have also presented the integration of IF dynamics diagnosis within a diagnosis technique for discrete event systems based on CPNs. This integration allows the acquisition of temporal information required to compute density and pseudoperiod. Therefore, the diagnosis system will be able to diagnose the IF dynamics.


  1. 1. Cassandrass C. G. Lafortune S. 1999 Introduction to Discrete Event Systems. Kluwer academic publishers.
  2. 2. Contant O. Lafortune S. Teneketzis D. 2004 "Diagnosis of intermittent failures.".Discrete event dynamic systems: Theory and applications. 14 171 202 .
  3. 3. Correcher A. García E. Morant F. Quiles E. Blasco R. 2004 "Intermittent Failure Diagnosis based on discrete event models.". Proceeding of 7’Th Workshop On Discrete Event Systems WODES04. 151 157 .
  4. 4. David R. Alla H. 2005 "Discrete, Continuous, and Hybrid Petri Nets", Springer-Verlag, Berlin.
  5. 5. García E. Rodríguez L. Morant F. Correcher A. Quiles E. 2008 "Latent Nestling Method: A new fault diagnosis methodology for complex systems" Proceeding of The 34th Annual Conference of the IEEE Industrial Electronics Society, IECON08.
  6. 6. Hensen K. 1997 "Coloured Petri Nets Basic Concepts, Analysis Methods and Practical Use", 1 Second Edition, Springer Editions.
  7. 7. Jensen F. 2003 Electronic component reliability. John Wiley and Sons Ltd.
  8. 8. Jiang S. Kumar R. Garcia H. E. 2003 "Diagnosis of repeated/intermittent failures in Discrete Event Systems.". IEEE transactions on robotics and automation. 19 n 2. 310 323 .
  9. 9. Leemis L. M. 1995 Reliability: Probabilistic Models and Statistical Methods. Prentice Hall.
  10. 10. Rodríguez L. García E. Morant F. Correcher A. Quiles E. 2008 "Application of Latent Nestling Method using Coloured Petri Nets for the Fault Diagnosis in the Wind Turbine Subsets.". Proceedings of ETFA’08, Hamburg, Germany, 2008
  11. 11. Smolensky P. Mozer M. Rumelhart D. D. Mathematical perspectives on neural networks. Mahwah, NJ: Lawrence Erlbaum Publishers. 1996EO
  12. 12. Sorensen B. A. Kelly G. Sajecki A. Sorensen P. W. 1998 "An analyzer for detecting aging faults in electronic devices". Proceeding of AUTOTESTCON’94. IEEE Systems Readiness Technology Conference and Proceeding of’Cost Effective Support Into the Next Century’. Page(s): 417 421 . Available:

Written By

A. Correcher, E. Garcia, F. Morant, E. Quiles and L. Rodriguez

Published: March 1st, 2010