Main causes of failure. The table shows the main cases of failure with a detailed description
1. Introduction
The study of component and process reliability is the basis of many efficiency evaluations in Operations Management discipline. For example, in the calculation of the Overall Equipment Effectiveness (OEE) introduced by Nakajima [1], it is necessary to estimate a crucial parameter called availability. This is strictly related to reliability. Still as an example, consider how, in the study of service level, it is important to know the availability of machines, which again depends on their reliability and maintainability.
Reliability is defined as the probability that a component (or an entire system) will perform its function for a specified period of time, when operating in its design environment. The elements necessary for the definition of reliability are, therefore, an unambiguous criterion for judging whether something is working or not and the exact definition of environmental conditions and usage. Then, reliability can be defined as the time dependent probability of correct operation if we assume that a component is used for its intended function in its design environment and if we clearly define what we mean with "failure". For this definition, any discussion on the reliability basics starts with the coverage of the key concepts of probability.
A broader definition of reliability is that "reliability is the science to predict, analyze, prevent and mitigate failures over time." It is a science, with its theoretical basis and principles. It also has subdisciplines, all related  in some way  to the study and knowledge of faults. Reliability is closely related to mathematics, and especially to statistics, physics, chemistry, mechanics and electronics. In the end, given that the human element is almost always part of the systems, it often has to do with psychology and psychiatry.
In addition to the prediction of system durability, reliability also tries to give answers to other questions. Indeed, we can try to derive from reliability also the availability performance of a system. In fact, availability depends on the time between two consecutive failures and on how long it takes to restore the system. Reliability study can be also used to understand how faults can be avoided. You can try to prevent potential failures, acting on the design, materials and maintenance.
Reliability involves almost all aspects related to the possession of a property: cost management, customer satisfaction, the proper management of resources, passing through the ability to sell products or services, safety and quality of the product.
This chapter presents a discussion of reliability theory, supported by practical examples of interest in operations management. Basic elements of probability theory, as the sample space, random events and Bayes' theorem should be revised for a deeper understanding.
2. Reliability basics
The period of regular operation of an equipment ends when any chemicalphysical phenomenon, said fault, occurred in one or more of its parts, determines a variation of its nominal performances. This makes the behavior of the device unacceptable. The equipment passes from the state of operation to that of nonfunctioning.
In Table 1 faults are classified according to their origin. For each failure mode an extended description is given.


Stress, shock, fatigue  Function of the temporal and spatial distribution of the load conditions and of the response of the material. The structural characteristics of the component play an important role, and should be assessed in the broadest form as possible, incorporating also possible design errors, embodiments, material defects, etc.. 
Temperature  Operational variable that depends mainly on the specific characteristics of the material (thermal inertia), as well as the spatial and temporal distribution of heat sources. 
Wear  State of physical degradation of the component; it manifests itself as a result of aging phenomena that accompany the normal activities (friction between the materials, exposure to harmful agents, etc..) 
Corrosion  Phenomenon that depends on the characteristics of the environment in which the component is operating. These conditions can lead to material degradation or chemical and physical processes that make the component no longer suitable. 
To study reliability you need to transform reality into a model, which allows the analysis by applying laws and analyzing its behavior [2]. Reliability models can be divided into static and dynamic ones.
In the traditional paradigm of static reliability, individual components have a binary status: either working or failed. Systems, in turn, are composed by an integer number
Let’s consider a generic
The state of operation of the system is modeled by the state function
The most common configuration of the components is the series system. A series system works if and only if all components work. Therefore, the status of a series system is given by the state function:
where the symbol
System configurations are often represented graphically with Reliability Block Diagrams (RBDs) where each component is represented by a block and the connections between them express the configuration of the system. The operation of the system depends on the ability to cross the diagram from left to right only by passing through the elements in operation. Figure 1 contains the RBD of a four components series system.
The second most common configuration of the components is the parallel system. A parallel system works if and only if at least one component is working. A parallel system does not work if and only if all components do not work. So, if
Accordingly, the state of a parallel system is given by the state function:
where the symbol
Another common configuration of the components is the seriesparallel systems. In these systems, components are configured using combinations in series and parallel configurations. An example of such a system is shown in Figure 3.
State functions for seriesparallel systems are obtained by decomposition of the system. With this approach, the system is broken down into subsystems or configurations that are in series or in parallel. The state functions of the subsystems are then combined appropriately, depending on how they are configured. A schematic example is shown in Figure 4.
A particular component configuration, widely recognized and used, is the
The RBD for a system
A Minimal Path Set  MPS is a subset of the components of the system such that the operation of all the components in the subset implies the operation of the system. The set is minimal because the removal of any element from the subset eliminates this property. An example is shown in Figure 5.
A Minimal Cut Set  MCS is a subset of the components of the system such that the failure of all components in the subset does not imply the operation of the system. Still, the set is called minimal because the removal of any component from the subset clears this property (see Figure 6).
MCS and MPS can be used to build equivalent configurations of more complex systems, not referable to the simple seriesparallel model. The first equivalent configuration is based on the consideration that the operation of all the components, in at least a MPS, entails the operation of the system. This configuration is, therefore, constructed with the creation of a series subsystem for each path using only the minimum components of that set. Then, these subsystems are connected in parallel. An example of an equivalent system is shown in Figure 7.
The second equivalent configuration, is based on the logical principle that the failure of all the components of any MCS implies the fault of the system. This configuration is built with the creation of a parallel subsystem for each MCS using only the components of that group. Then, these subsystems are connected in series (see Figure 8).
After examining the components and the status of the system, the next step in the static modeling of reliability is that of considering the probability of operation of the component and of the system.
The reliability
while the
The methodology used to calculate the reliability of the system depends on the configuration of the system itself. For a series system, the reliability of the system is given by the product of the individual reliability (law of Lusser, defined by German engineer Robert Lusser in the 50s):
For an example, see Figure 9.
For a parallel system, reliability is:
In fact, from the definition of system reliability and by the properties of event probabilities, it follows:
In many parallel systems, components are identical. In this case, the reliability of a parallel system with
For a seriesparallel system, system reliability is determined using the same approach of decomposition used to construct the state function for such systems. Consider, for instance, the system drawn in Figure 11, consisting of 9 elements with reliability
To calculate the overall reliability, for all other types of systems which can’t be brought back to a seriesparallel scheme, it must be adopted a more intensive calculation approach [3] that is normally done with the aid of special software.
Reliability functions of the system can also be used to calculate measures of
These measurements are used to assess which components of a system offer the greatest opportunity to improve the overall reliability. The most widely recognized definition of reliability importance
For other system configurations, an alternative approach facilitates the calculation of reliability importance of the components. Let
In a series system, this formulation is equivalent to writing:
Thus, the most important component (in terms of reliability) in a series system is the less reliable. For example, consider three elements of reliability
If the system is arranged in parallel, the reliability importance becomes as follows:
It follows that the most important component in a parallel system is the more reliable. With the same data as the previous example, this time having a parallel arrangement, we can verify Eq. 16 for the first item:
For the calculation of the reliability importance of components belonging to complex systems, which are not attributable to the seriesparallel simple scheme, reliability of different systems must be counted. For this reason the calculation is often done using automated algorithms.
3. Fleet reliability
Suppose you have studied the reliability of a component, and found that it is 80% for a mission duration of 3 hours. Knowing that we have 5 identical items simultaneously active, we might be interested in knowing what the overall reliability of the group would be. In other words, we want to know what is the probability of having a certain number of items functioning at the end of the 3 hours of mission. This issue is best known as fleet reliability.
Consider a set of
The expected value of
Let’s consider, for example, a corporate fleet consisting of 100 independent and identical systems. All systems have the same mission, independent from the other missions. Each system has a reliability of mission equal to 90%. We want to calculate the average number of missions completed and also what is the probability that at least 95% of systems would complete their mission. This involves analyzing the distribution of the binomial random variable characterized by
The probability that at least 95% of the systems complete their mission can be calculated as the sum of the probabilities that complete their mission 95, 96, 97, 98, 99 and 100 elements of the fleet:
4. Time dependent reliability models
When reliability is expressed as a function of time, the continuous random variable, not negative, of interest is
In the context of reliability, two additional functions are often used: the
The
Integrating by parts, we can prove the equivalent expression:
5. Hazard function
Another very important function is the
The hazard function
Thanks to Bayes' theorem, it can be shown that the relationship between the hazard function, density of probability of failure and reliability is the following:
Thanks to the previous equation, with some simple mathematical manipulations, we obtain the following relation:
In fact, since
From equation 24 derive the other two fundamental relations:
The most popular conceptual model of the hazard function is the
Later, at the end of the life of the device, the failure rate increases due to wear phenomena. They are caused by alterations of the component for material and structural aging. The beginning of the period of wear is identified by an increase in the frequency of failures which continues as time goes by. The
Between the period of early failures and of wearout, the failure rate is about constant: failures are due to random events and are called
The most common mathematical classifications of the hazard curve are the so called
The CFR model is based on the assumption that the failure rate does not change over time. Mathematically, this model is the most simple and is based on the principle that the faults are purely random events. The IFR model is based on the assumption that the failure rate grows up over time. The model assumes that faults become more likely over time because of wear, as is frequently found in mechanical components. The DFR model is based on the assumption that the failure rate decreases over time. This model assumes that failures become less likely as time goes by, as it occurs in some electronic components.
Since the failure rate may change over time, one can define a reliability parameter that behaves as if there was a kind of counter that accumulates hours of operation. The
Applying Bayes' theorem we have:
And, given that
The
For an IFR device, the residual reliability and the residual MTTF, decrease progressively as the device accumulates hours of operation. This behavior explains the use of preventive actions to avoid failures. For a DFR device, both the residual reliability and the residual MTTF increase while the device accumulates hours of operation. This behavior motivates the use of an intense running (burnin) to avoid errors in the field.
The
The
Let us consider a CFR device with a constant failure rate
The corresponding cumulative distribution function
The reliability function
For CFR items, the residual reliability and the residual MTTF both remain constant when the device accumulates hours of operation. In fact, from the definition of residual reliability,
Similarly, for the residual MTTF, is true the invariance in time:
This behavior implies that the actions of prevention and running are useless for CFR devices. Figure 13 shows the trend of the function
The probability of having a fault, not yet occurred at time
Recalling the Bayes' theorem, in which we consider the probability of an hypothesis H, being known the evidence E:
we can replace the evidence E with the fact that the fault has not yet taken place, from which we obtain
Since
As can be seen, this probability does not depend on
The use of the constant failure rate model, facilitates the calculation of the characteristic life of a device. In fact for a CFR item,
Therefore, the characteristic life, in addition to be calculated as the time value
The definition of MTTF, in the CFR model, can be integrated by parts and give:
In the CFR model, then, the MTTF and the characteristic life coincide and are equal to
Let us consider, for example, a component with constant failure rate equal to
From equation 43 we have:
For the law of the reliability
The probability that the component survives other
Suppose now that it has worked without failure for
6. CFR in series
Let us consider
Since the reliability of the overall system will take the form of the type
In a system of CFR elements arranged in series, then, the failure rate of the system is equal to the sum of failure rates of the components. The MTTF can thus be calculated using the simple relation:
For example, let me show the following example. A system consists of a pump and a filter, used to separate two parts of a mixture: the concentrate and the squeezing. Knowing that the failure rate of the pump is constant and is
To begin, we compare the physical arrangement with the reliability one, as represented in the following figure:
As can be seen, it is a simple series, for which we can write:
As a year of continuous operation is
7. CFR in parallel
If two components arranged in parallel are similar and have constant failure rate λ, the reliability of the system
The calculation of the MTTF leads to
Therefore, the MTTF increases compared to the single component CFR. The failure rate of the parallel system
As you can see, the failure rate is not halved, but was reduced by one third.
For example, let us consider a safety system which consists of two batteries and each one is able to compensate for the lack of electric power of the grid. The two generators are equal and have a constant failure rate
As in the previous case, we start with a reliability block diagram of the problem, as visible in Figure 15.
It is a parallel arrangement, for which the following equation is applicable:
The MTTF is the reciprocal of the failure rate and is:
As a year of continuous operation is
It is interesting to calculate the reliability of a system of identical elements arranged in a parallel configuration
Let us consider, for example, three electric generators, arranged in parallel and with failure rate
We’ll have:
A particular arrangement of components is that of the socalled parallel with standby: the second component comes into operation only when the first fails. Otherwise, it is idle.
If the components are similar, then
Thus, in parallel with standby, the MTTF is doubled.=
8. Repairable systems
The devices for which it is possible to perform some operations that allow to reactivate the functionality, deserve special attention. A repairable system [6] is a system that, after the failure, can be restored to a functional condition from any action of maintenance, including replacement of the entire system. Maintenance actions performed on a repairable system can be classified into two groups:
As corrective actions, preventive activities may correspond to both repair and replacement activities. Finally, note that the actions of operational maintenance (servicing) such as, for example, put gas in a vehicle, are not considered PM [7].
Preventative maintenance can be divided into two subcategories:
To adopt a CBM policy requires investment in instrumentation and prediction and control systems: you must run a thorough feasibility study to see if the cost of implementing the apparatus are truly sustainable in the system by reducing maintenance costs.
The CBM approach consists of the following steps:
group the data from the sensors;
diagnose the condition;
estimate the Remaining Useful Life – RUL;
decide whether to maintain or to continue to operate normally.
CBM schedule is modeled with algorithms aiming at high effectiveness, in terms of cost minimization, being subject to constraints such as, for example, the maximum time for the maintenance action, the periods of high production rate, the timing of supply of the pieces parts, the maximization of the availability and so on.
In support of the prognosis, it is now widespread the use of diagrams that do understand, even graphically, when the sensor outputs reach alarm levels. They also set out the alert thresholds that identify ranges of values for which maintenance action must arise [9].
Starting from a state of degradation, detected by a measurement at the time
continue to operate: if we are in the area of not alarming values. It is also possible that being in the area of preventive maintenance, we opt for a postponement of maintenance because it has already been established replacement intervention within a short interval of time
stop the task: if we are in the area of values above the threshold established for preventive maintenance of condition.
The modeling of repairable systems is commonly used to evaluate the performance of one or more repairable systems and of the related maintenance policies. The information can also be used in the initial phase of design of the systems themselves.
In the traditional paradigm of modeling, a repairable system can only be in one of two states: working (up) or inoperative (down). Note that a system may not be functioning not only for a fault, but also for preventive or corrective maintenance.
9. Availability
Availability may be generically be defined as the percentage of time that a repairable system is in an operating condition. However, in the literature, there are four specific measures of repairable system availability. We consider only the
The limit availability just seen is also called
The models of the impact of preventive and corrective maintenance on the age of the component, distinguish in perfect, minimal and imperfect maintenance.
The average duration of maintenance activity is the expected value of the probability distribution of repair time and is called
Figure 17 shows the state functions of two repairable systems with increasing failure rate, maintained with perfect and minimal repair.
10. The general substitution model
The general substitution model, states that the failure time of a repairable system is an unspecified random variable. The duration of corrective maintenance (perfect) is also a random variable. In this model it is assumed that preventive maintenance is not performed.
Let’s denote by
Let us now designate with
Regardless of the probability distributions governing
11. The substitution model for CFR
Let us consider the special case of the general substitution model where
Let us analyze, for example, a repairable system, subject to a replacement policy, with failure and repair times distributed according to negative exponential distribution. MTTF=1000 hours and MTTR=10 hours.
Let’s calculate the limit availability of the system. The formulation of the limit availability in this system is given by eq. 63, so we have:
This means that the system is available for 99% of the time.
12. General model of minimal repair
After examining the substitution model, we now want to consider a second model for repairable system: the general model of minimal repair. According to this model, the time of system failure is a random variable. Corrective maintenance is instantaneous, the repair is minimal, and not any preventive activity is performed.
The times of arrival of faults, in a repairable system corresponding to the general model of minimal repair, correspond to a process of random experiments, each of which is regulated by the same negative exponential distribution. As known, having neglected the repair time, the number of faults detected by time
13. Minimal repair with CFR
A wellknown special case of the general model of minimal repair, is obtained if the failure time
In this case, the general model of minimal repair is simplified because the number
If, for example, we consider
Finally, we can obtain the probability mass function of
Also, the probability mass function of
Since the two values are equal, the conclusion is that in the homogeneous Poisson process (CFR), the number of faults in a given interval depends only on the range amplitude.
The behavior of a Poisson mass probability distribution, with rate equal to 5 faults each year, representing the probability of having
Since in the model of minimal repair with CFR, repair time is supposed to be zero (MTTR = 0), the following relation applies:
Suppose that a system, subjected to a repair model of minimal repair, shows failures according to a homogeneous Poisson process with failure rate
The estimate of the average number of failures in 5000 hours, can be carried out with the expected value function:
The probability of having not more than 15 faults in a period of 5000 hours of operation, is calculated with the sum of the probability mass function evaluated between 0 and 15:
14. Minimal repair: Power law
A second special case of the general model of minimal repair, is obtained if the failure time
In this case the sequence of failure times is described by a
Since the cumulative intensity of the process is defined by:
the cumulative function is:
As it can be seen, the average number of faults occurring within the time
In fact, if we take
The probability mass function of
For example, let us consider a system that fails, according to a power law, having
The average number of failures that occur during the first 1000 hours of operation, is calculated with the expected value of the distribution:
The probability of two or more failures during the first 1000 hours of operation can be calculated as complementary to the probability of having zero or one failure:
The average number of faults in the succeeding 1000 hours of operation is calculated using the equation:
that, in this case, is:
15. Conclusion
After seeing the main definitions of reliability and maintenance, let's finally see how we can use reliability knowledge also to carry out an economic optimization of replacement activities.
Consider a process that follows the power law with
Let us define with
Let’s denote by
If the repairable system is replaced every
The average cost per unit of time
Then follows:
Differentiating
Consider, for example, a system that fails according to a Weibull distribution with
The application of eq. 81 provides the answer to the question:
Nomenclature
RBD: Reliability Block Diagram
CBM: ConditionBased Maintenance
CFR: Constant Failure Rate
CM: Corrective Maintenance
DFR: Decreasing Failure Rate
IFR: Increasing Failure Rate
MCS: Minimal Cut Set
MPS: Minimal Path Set
MTTF: Mean Time To Failure
MTTR: Mean Time To Repair
NHPP: NonHomogeneous Poisson Process
PM: Preventive Maintenance
References
 1.
Nakajima S. Introduction to TPM: Total Productive Maintenance. Productivity Press, Inc., 1988, 1988:129.  2.
Barlow RE. Engineering Reliability. SIAM; 2003.  3.
De Carlo F. Impianti industriali: conoscere e progettare i sistemi produttivi. New York: Mario Tucci; 2012.  4.
O’Connor P, Kleyner A. Practical Reliability Engineering. John Wiley & Sons; 2011.  5.
Meyer P. Understanding Measurement: Reliability. Oxford University Press; 2010.  6.
Ascher H, Feingold H. Repairable systems reliability: modeling, inference, misconceptions and their causes. M. Dekker; 1984.  7.
De Carlo F, Borgia O, Adriani PG, Paoli M. New maintenance opportunities in legacy plants. 34th ESReDA Seminar, San Sebastian, Spain: 2008.  8.
Gertler J. Fault detection and diagnosis in engineering systems. Marcel Dekker; 1998.  9.
Borgia O, De Carlo F, Tucci M. From diagnosis to prognosis: A maintenance experience for an electric locomotive. Safety, Reliability and Risk Analysis: Theory, Methods and Applications  Proceedings of the Joint ESREL and SRAEurope Conference, vol. 1, 2009, pp. 211–8.  10.
Racioppi G, Monaci G, Michelassi C, Saccardi D, Borgia O, De Carlo F. Availability assessment for a gas plant. Petroleum Technology Quarterly 2008;13:33–7.