Components failures during use hours.
Failure prediction is one of the key challenges that have to be mastered for a new arena of fault tolerance techniques: the proactive handling of faults. As a definition, prediction is a statement about what will happen or might happen in the future. A failure is defined as “an event that occurs when the delivered service deviates from correct service.” The main point here is that a failure refers to misbehavior that can be observed by the user, which can either be a human or another computer system. Things may go wrong inside the system, but as long as it does not result in incorrect output (including the case that there is no output at all) there is no failure. Failure prediction is about assessing the risk of failure for some time in the future. In my approach, failures are predicted by analysis of error events that have occurred in the system. As, of course, not all events that have occurred ever since can be processed, only events of a time interval called embedding time are used. Failure probabilities are computed not only for one point of time in the future, but for a time interval called prediction interval.
Failure prediction is one of the key challenges that have to be mastered for a new arena of fault tolerance techniques: the proactive handling of faults. As a definition, prediction is a statement about what will happen or might happen in the future. A failure means “an occurrence that happens when the delivered service gets out from correct service.”
The main point here is that a failure derives of misbehavior that can be observed by the operator, which can either be a human or another computer system. Some things may go wrong inside the system, but as long as it does not eventuate in incorrect output (such as the system that there is no output at all) the system can run without failure. Failure prediction is about evaluation the risk of failure for some times in the future. In my viewpoint, analysis of error events that have occurred in the system can be called failure prediction. To compute breakdown probabilities, not only one point of time in the future, but a time interval called prediction interval are considered, simultaneously.
Failure rates and their projective manifestations are important factors in insurance, business, and regulation practices as well as fundamental to design of safe systems throughout a national or international economy. From an economic view point, inaction owing to machinery failures as a consequence of downtimes can be so costly. Repairs of broken down machines are also expensive, because the breakdowns consume resources: manpower, spare parts, and even loss of production. As a result, the repair costs can be considered as an important component of the total machine ownership costs. Traditional maintenance policies include corrective maintenance (CM) and preventive maintenance (PM). With CM policy, maintenance is performed after a breakdown or the occurrence of an obvious fault. With PM policy, maintenance is performed to prevent equipment breakdown. As an example, it is appeared that in developing countries, almost 53% of total machine expenses have spent to repair machine breakdowns whereas it was 8% in developed countries, that founding the effective and practicable repair and maintenance program could decreased these costs up to 50%.
The complex of maintenance activities, methodologies and tools aim to obtain the continuity of the productive process; traditionally, this objective was achieved by reviewing and substituting the critical systems or through operational and functional excess in order to guarantee an excess of productive capacity. All these approaches have partially emerged inefficiencies: redundant systems and surplus capacity immobilize capitals that could be used more Affordable for the production activities, while accomplishing revision policies very careful means to support a rather expensive method to achieve the demand standards. The complex of maintenance activities is turned from a simple reparation activity to a complex managerial task which main aim is the prevention of failure. An optimal maintenance approach is a key support to industrial production in the contemporary process industry and many tools have been developed for improving and optimizing this task.
The majority of industrial systems have a high level of complexity, nevertheless, in many cases, they can be repaired. Moreover historical and or benchmarking data, related to systems failure and repair patterns, are difficult to obtain and often they are not enough reliable due to various practical constraints. In such circumstances, it is evident that a good RAM analysis can play a key role in the design phase and in any modification required for achieving the optimized performance of such systems. The assessing of components reliability is a basic sight for appropriate maintenance performance; available reliability assessing procedures are based on the accessibility of knowledge about component states. Nevertheless, the states of component are often uncertain or unknown, particularly during the early stages of the new systems development. So for these cases, comprehending of how uncertainties will affect system reliability evaluation is essential. Systems reliability often relies on their age, intrinsic factors (dimensioning, components quality, material, etc.) and use conditions (environment, load rate, stress, etc.). The parameter defining a machine’s reliability is the failure rate (λ), and this value is the characteristic of breakdown occurrence frequency. In this context, failure rate analysis constitute a strategic method for integrating reliability, availability and maintainability, by using methods, tools and engineering techniques (such as Mean Time to Failure, Equipment down Time and System Availability values) to identify and quantify equipment and system failures that prevent the achievement of its objectives. At first we define common words related to failure rate:
A failure occurs when a component is not available. The cause of components failure is different; they may fail due to have been randomly chosen and marked as fail to assess their effect, or they may fail because any other component that were depending on else has brake down. In reliability engineering, a Failure is considered to event when a component/system is not doing its favorable performance and considered as being unavailable.
In reliability engineering, an error is said a misdeed which is the root cause of a failure.
In reliability engineering, a fault is defined as a malfunction which is the root cause of an error. But within this chapter, we may refer to a component failure as a fault that may be conducted to the system failure. This is done where there is a risk of obscurity between a failure which is occurring in intermediate levels (referred to as a Fault) and one which is occurring finally (referred to as Failure).
2. Failure Rate
The reliability of a machine is its probability to perform its function within a defined period with certain restrictions under certain conditions. The reliability is the proportional expression of a machine’s operational availability; therefore, it can be defined as the period when a machine can operate without any breakdowns. The equipment reliability depends to failures frequency, which is expressed by MTBF 1 . Reliability predictions are based on failure rates. Failure intensity or λ(t) 2 can be defined as “the foretasted number of times an item will break down in a determined time period, given that it was as good as new at time zero and is functioning at time t”. This computed value provides a measurement of reliability for an equipment. This value is currently described as failures per million hours (f/mh). As an example, a component with a failure rate of 10 fpmh would be anticipated to fail 10 times for 1 million hours time period. The calculations of failure rate are based on complex models which include factors using specific component data such as stress, environment and temperature. In the prediction model, assembled components are organized serially. Thus, failure rates for assemblies are calculated by sum of the individual failure rates for components within the assembly. The MTBF was determined using Eq. (1). Failure rate which is equal to the reciprocal of the mean time between failures (MTBF) defined in hours (λ) was calculated by using Eq. (2) .
where, MTBF is mean time between failures, h; T is total time, h; n is number of failures; λ is failure rate, failures per 10n h.
There are some common basic categories of failure rates:
Mean Time Between Failures (MTBF)
Mean Time To Failure (MTTF)
Mean Time To Repair (MTTR)
Mean Down Time (MDT)
Probability of Failure on Demand (PFD)
Safety Integrity Level (SIL)
2.1. Mean time between failures (MTBF)
The basic measure of reliability is mean time between failures (MTBF) for repairable equipment. MTBF can be expressed as the time passed before a component, assembly, or system break downs, under the condition of a constant failure rate. On the other hand, MTBF of repairable systems is the predicted value of time between two successive failures. It is a commonly used variable in reliability and maintainability analyses. MTBF can be calculated as the inverse of the failure rate, λ, for constant failure rate systems. For example, for a component with a failure rate of 2 failures per million hours, the MTBF would be the inverse of that failure rate, λ, or:
NOTE: Although MTBF was designed for use with repairable items, it is commonly used for both repairable and non-repairable items. For non-repairable items, MTBF is the time until the first (an only) failure after t0.
Any unit of time can be mentioned as failure rate unit, but hours is the most common unit in practice. Other units included miles, revolutions, etc., which can also replace the time units.
In engineering notation, failure rates are often very low because failure rates are often expressed as failures per million (10−6), particularly for individual components.
The failures in time (FIT) rate for a component is the number of failures that can be occurred in one billion (109) use hours. (e.g., 1000 components for 1 million hours, or 1 million components for each 1000 hours, or some other combination). Semiconductor industry currently used this unit.
Example 1 If we aim to estimate the failure rate of a certain component, we can carry out this test. Suppose each one of 10 same components are tested until they either break down or reach 1000 hours, after this time the test is completed for each component. The results are shown in Table 1 as follows:
Example 2 If a tractor be operated 24 hours a day, 7 days a week, so it will run 6540 hours for 1 year and at which time the MTBF number of a tractor be 1,050,000 hours:
In the average year, we can expect to fail about 0.62% of these tractors.
Example 3 Now assuming a tractor be operated at 6320 hours a year and at which time the MTBF number of this be 63,000 hours.
You assume, we let the identical tractor run 24 hours a day, 7 days a week:
3.1. Mean time to failure (MTTF)
One of basic measures of reliability is mean time to failure (MTTF) for non-repairable systems. This statistical value is defined as the average time expected until the first failure of a component of equipment. MTTF is intended to be the mean over a long period of time and with a large number of units. For constant failure rate systems, MTTF can calculated by the failure rate inverse, 1/λ. Assuming failure rate, λ, be in terms of failures/million hours, MTTF = 1,000,000/failure rate, λ, for components with exponential distributions. Or:
For repairable systems, MTTF is the anticipated time period from repair to the first or next break down.
3.2. Mean time to repair (MTTR)
Mean time to repair (MTTR) can described as the total time that spent to perform all corrective or preventative maintenance repairs divided by the total of repair numbers. It is the anticipated time period from a failure (or shut down) to the repair or maintenance fulfillment. This is a term that typically only used in repairable systems.
Four failure frequencies are commonly used in reliability analyses:
Failure Density f(t)- The failure density of a component or system means that first failure what is likely to occur in the component or system at time t. In such cases, the component or system was running at time zero.
Failure Rate or r(t)- The failure rate of a component or system is expressed as the probability per unit time that the component or system experiences a failure at time t. In such cases, the component or system was using at time zero and has run to time t.
Conditional failure rate or conditional failure intensity λ(t)– The conditional failure rate of a component or system is the probability per unit time that a failure occurs in the component or system at time t, so the component or system was operating, or was repaired to be as good as new, at time zero and is operating at time t.
Unconditional failure intensity or failure frequency ω(t)– The definition of the unconditional failure intensity of a component or system is the probability per unit time when the component or system fail at time t. In such cases, the component or system was using at time zero. The following relations (4) exist between failure parameters .
The difference between definitions for failure rate r(t) and conditional failure intensity λ(t) refers to first failure that the failure rate specifies this for the component or system rather than any failure of the component or system. Especially, if the failure rate being constant at considered time or if the component is non-repairable. These two quantities are same. So:
The conditional failure intensity (CFI) λ(t) and unconditional failure intensity ω(t) are different because the CFI has an additional condition that the component or system has survived to time t. The equation (5) mathematically showed the relationship between these two quantities.
3.3. Constant failure rates
If the failure rate is constant then the following expressions (6) apply:
As can be seen from the equation above, a constant failure rate results in an exponential failure density distribution.
3.4. Mean down time (MDT)
In organizational management, mean down time (MDT) is defined as the mean time that a system is not usable. This includes all time such as repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or administrative delays. The MDT and MTTR (mean time to repair) are difference due to the MDT includes any and all delays involved; MTTR looks particularly at repair time.
Sometimes, Mean Time To Repair (MTTR) is used in this formula instead of MDT. But MTTR may not be the identical as MDT because:
Sometimes, the breakdown may not be considered after it has happened
The decision may be not to repair the equipment immediately
The equipment may not be put back in service immediately it is repaired
If you used MDT or MTTR, it is important that it reflects the total time for which the equipment is unavailable for service, on the other hands the computed availability will be incorrect.
In the process industries, MTTR is often taken to be 8 hours, the length of a common work shift but the repair time really might be different particularly in an installation.
3.5. Probability of failure on demand (PFD)
PFD is probability of failure on demand. The design of safety systems are often such that to work in the background, monitoring a process, but not doing anything until a safety limit is overpassed when they must take some action to keep the process safe. These safety systems are often known as emergency shutdown (ESD) systems.
PFD means the unavailability of a safety task. If a demand to act occurs after a time, what is the probability that the safety function has already failed? As you might expect, the PFD equation looks like the equation (7) for general unavailability :
Note that we talk about PFDavg here, the mean probability of failure on demand, which is really the correct term to use, since the probability does change over time—the failure probability of a system will relied on how long ago you tested it.
λDU is the failure rate of dangerous undetected failures. We are not counting any failures that are guessed to be “safe,” perhaps because they cause the process to shut down, only those failures which remain hidden but will fail the operation of the safety function when it is called upon.
This is essential as it assures us not to suppose that a safety-related product is generally more reliable than a general purpose product. The aim of safety-related product design is to have especially low failure rate of the safety task, but its total failure rate (MTBF) may not be so efficient.
So, the MDT for a safety function is defined as a dangerous undetected failure will not be obvious until either a demand comes along or a proof test would be revealed it.
Suppose we proof test our safety function every year or two, say every T1 hours. The safety function is equally likely to fail at any time between one proof test and the next, so, on average it is down for T1/2 hours.
From this we get the simplest form of PFD calculation for safety functions :
Under reliability engineering, SIL is one of the most abused terms. “SIL” is often used to mention that an equipment or system show better quality, higher reliability, or some other desirable feature. It does not. SIL actually means safety integrity level and has a range between 1 and 4. It is applied to depict the safety protection degree required by a process and finally the safety reliability of the safety system is essential to obtain that protection. SIL4 shows the highest level of safety protection and SIL1 is the lowest.
Many products are demonstrated by “SIL” rated. This means that they are appropriate for use in safety systems. In fact, if this is true, it relies on a lot of detail, which is beyond the scope of this chapter. But remember that even when a product indeed matches with “SIL” needs that are only reminding you that it will do a definite job in a safety system. This safety reliability may be high, but its general reliability may not be, as mentioned in the prior section.
Useful to remember
If an item works for a long time without breakdown, it can be said is highly reliable.
If an item does not fail very often and, when it does, it can be quickly returned to service, it would be highly available.
If a system is reliable in performing its safety function, it is considered to be safe. The system may fail much more frequently in modes that are not considered to be dangerous.
Finally, a safety system may be has lower MTBF in total than a non-safety system performing a similar function.
“SIL” does not mean a guarantee of quality or reliability, except in a defined safety context.
MTBF is a measure of reliability, but it is not the expected life, the useful life, or the average life.
Calculations of reliability and failure rate of redundant systems are complex and often counter-intuitive.
4. Failure types
Failures generally be grouped into three basic types, though there may be more than one cause for a particular case. The three types included: early failures, random failures and wear-out failures. In the early life stage, failures as infant mortality often due to defects that escape the manufacturing process. In general, when the defective parts fail leaving a group of defect free products, the number of failures caused by manufacture problems decrease. Consequently the early stage failure rate decreases with age. During the useful life, failures may related to freak accidents and mishandling that subject the product to unexpected stress conditions. Suppose the failure rate over the useful life is generally very low and constant. As the equipment reaches to the wear-out stage, the degradation of equipment is related to repetitious or constant stress conditions. The failure rate during the wear-out stage increases dramatically as more and more occurs failure in equipment that caused by wear-out failures. When plotting the failure rate over time as illustrated in Figure 1 , these stages make the so-called “bath tub” curve.
4.1. Early life period
To ensure the integrity of design, we used many methods. Some of the design techniques include: burn-in (to stress devices under constant operating conditions); power cycling (to stress devices under the surges of turn-on and turn-off); temperature cycling (to mechanically and electrically stress devices over the temperature extremes); vibration; testing at the thermal destruct limits; highly accelerated stress and life testing; etc. Despite usage of all these design tools and manufacturing tools such as six sigma and quality improvement techniques, there will still be some early failures because we will not able to control processes at the molecular level. There is always the risk that, although the most up to date techniques are used in design and manufacture, early breakdowns will happen. In order to remove these risks — especially in newer product consumes some of the early useful life of a module via stress screening. The start of operating life in initial peak represents the highest risk of failure; since in this technique, the units are allowed to begin their somewhere closer to the flat portion of the bathtub curve. Two factors included burn in and temperature cycling consumed the operating life. The amount of screening needed for acceptable quality is a function of the process grade as well as history. M-Grade modules are screened more than I-Grade modules, and I-Grade modules are screened more than C-Grade units.
4.2. Useful life period
The maturity of product is caused that the weaker units extinct, the failure rate nearly shows a constant trend, and modules have entered what is considered the normal life period. This period is characterized by a relatively constant failure rate. The length of this period is related to the product or component system life. During this period of time, the lowest failure rate happens. Notice how the amplitude on the bathtub curve is at its lowest during this time. The useful life period is the most common time frame for making reliability predictions.
4.3. MTBF vs. useful life
Sometimes MTBF is Mistakenly used instead of component’s useful life. Consider, the useful life of a battery is 10 hours and the measure of MTBF is 100,000 hours. This means that in a set of 100,000 batteries, there will be about one battery failure every 1 hour during their useful lives.
Sometimes these numbers are so much high, it is related to the basis calculations of failure rate in usefulness period of component, and we suppose that the component will remain in this stage for a long period of time. In the above example, wear-out period decreases the component life, and the usefulness period becomes much smaller than its MTBF so there is not necessarily direct correlation between these two.
Consider another example, there are 15,000 18-year-old humans in the sample. Our investigation is related to 1 year. During this period, the death rate became 15/15,000 = 0.1%/year. The inverse of the failure rate or MTBF is 1/0.001 = 1000. This example represents that high MTBF values is different from the life expectancy. As people become older, more deaths occur, so the best way to calculate MTBF would be monitor the sample to reach their end of life. Then, the average of these life spans are computed. Then we approach to the order of 75–80 which would be very realistic.
4.4. Wear-out period
As fatigue or wear-out occurs in components, failure rates increasing high. Power wear-out supplies is usually due to the electrical components breakdown that are subject to physical wear and electrical and thermal stress. Furthermore, the MTBFs or FIT rates calculated in the useful life period no longer apply in this area of the graph. A product with a MTBF of 10 years can still exhibit wear-out in 2 years. The wear-out time of components cannot predict by parts count method. Electronics in general, and Vicor power supplies in particular, are designed so that the useful life extends past the design life. This way wear-out should never occur during the useful life of a module.
4.5. Failure sources
There are two major categories for system outages: 1. Unplanned outages (failure) and 2. Planned outages (maintenance) that both conducted to downtime. In terms of cost, unplanned and planned outages are compared but use the redundant components maybe mitigate it. The planned outage usually has a sustainable impact on the system availability, if their schematization be appropriate. They are mostly happen due to maintenance. Some causes included periodic backup, changes in configuration, software upgrades and patches can caused by planned downtime. According to prior research studies 44% of downtime in service providers is unscheduled. This downtime period can spent lots of money.
Another categorization can be:
Specification and design flaws, manufacturing defects and wear-out categorized as internal factors. The radiation, electromagnetic interference, operator error and natural disasters can considered as external factors. However, a well-designed system or the components are highly reliable, the failures are unavoidable, but their impact mitigation on the system is possible.
4.6. Failure rate data
The most common ways that failure rate data can be obtained as following:
Historical data about the device or system under consideration.
Many organizations register the failure information of the equipment or systems that they produce, in which calculation of failure rates can be used for those devices or systems. For equipment or systems that produce recently, the historical data of similar equipment or systems can serve as a useful estimate.
Government and commercial failure rate data.
The available handbooks of failure rate data for various equipment can be obtained from government and commercial sources. MIL-HDBK-217F, reliability prediction of electrical equipment, is a military standard that provides failure rate data for many military electronic components. Several failure rate data sources are available commercially that focus on commercial components, including some non-electronic components.
The most accurate source of data is to test samples of the actual devices or systems in order to generate failure data. This is often prohibitively expensive or impractical, so that the previous data sources are often used instead.
4.7. Failure distribution types
The different types of failure distribution are provided in Table 2 . For an exponential failure distribution the hazard rate is a constant with respect to time (that is, the distribution is “memoryless”). For other distributions, such as a Weibull distribution or a log-normal distribution, the hazard function is not constant with respect to time. For some such as the deterministic distribution it is monotonic increasing (analogous to “wearing out”), for others such as the Pareto distribution it is monotonic decreasing (analogous to “burning in”), while for many it is not monotonic.
|Multinomial||Beyond the scope||Lognormal||Covered|
|Extreme value||Beyond the scope|
4.8. Derivations of failure rate equations for series and parallel systems
This section shows the derivations of the system failure rates for series and parallel configurations of constant failure rate components in Lambda Predict.
4.9. Series system failure rate equations
Consider a system consisting of n components in series. For this configuration, the system reliability, Rs, is given by :
where R1, R2, …, Rn are the values of reliability for the n components. If the failure rates of the components are λ1, λ2,…, λn, then the system reliability is:
Therefore, the system reliability can be expressed in terms of the system failure rate, λS, as:
It should be pointed out that if n blocks with non-constant (i.e., time-dependent) failure rates are arranged in a series configuration, then the system failure rate has a similar equation to the one for constant failure rate blocks arranged in series and is given by:
4.10. Parallel system failure rate equations
Consider a system with n identical constant failure rate components arranged in a simple parallel configuration. For this case, the system reliability equation is given by:
where RC is the reliability of each component. Substituting the expression for component reliability in terms of the constant component failure rate, λC, yields:
Notice that this equation does not reduce to the form of a simple exponential distribution like for the case of a system of components arranged in series. In other words, the reliability of a system of constant failure rate components arranged in parallel cannot be modeled using a constant system failure rate model.
To find the failure rate of a system of n components in parallel, the relationship between the reliability function, the probability density function and the failure rate is employed. The failure rate is defined as the ratio between the probability density and reliability functions, or:
Because the probability density function can be written in terms of the time derivative of the reliability function, the previous equation becomes:
The reliability of a system of n components in parallel is:
and its time derivative is:
Substituting into the expression for the system failure rate yields:
For constant failure rate components, the system failure rate becomes:
Thus, the failure rate for identical constant failure rate components arranged in parallel is time-dependent. Taking the limit of the system failure rate as t approaches infinity leads to the following expression for the steady-state system failure rate:
Applying L’Hopital’s rule one obtains:
So the steady-state failure rate for a system of constant failure rate components in a simple parallel arrangement is the failure rate of a single component. It can be shown that for a k-out-of-n parallel configuration with identical components:
- Mean time between failures.
- Conditional failure rate.