Test results showing proportions of failure mechanisms for given
The multiple temperature operational life (MTOL) testing method is used to calculate the failure in time (FIT) by a linear combination of constant‐rate failure mechanisms. This chapter demonstrates that, unlike other conventional qualification procedures, the MTOL testing procedure gives a broad description of the reliability from sub‐zero to high temperatures. This procedure can replace the more standard single‐condition high‐temperature operational life (HTOL) by predicting the system failure rate by testing a small number of components over more extreme accelerated conditions for much shorter times than is conventionally used. The result is a much more accurate result for the failure rate, calculating the mean time to failure (MTTF) based on much shorter time‐scale testing only a fraction of the number of components. Rather than testing 77 parts for 1000 h, a failure rate prediction can be obtained from as few as 15 parts tested for only 200 h with reliable results.
- failure rate
- multiple mechanisms
1. Introduction to MTOL
Traditional high‐temperature operational life (HTOL) test strategy is based on the outdated JEDEC standard that has not been supported or updated for many years. The major drawback of this method is that it is not based on a model that predicts failures in the field. Nonetheless, the electronics industry continues to provide data from tests of fewer than 100 parts, subjected to their maximum allowed voltages and temperatures for as many as 1000 h. The result based on zero, or a maximum of one, failure out of the number of parts tested does not actually predict. This null result is then fit into an average acceleration factor (AF), which is the product of a thermal factor and a voltage factor. The result is a reported failure rate as described by the standard failure in time (FIT) model, which is the number of expected failures per billion part hours of operation. FIT is still an important metric for failure rate in today’s technology; however, it does not account for the fact that multiple failure mechanisms simply cannot be averaged for either thermal or voltage acceleration factors.
One of the major limitations of advanced electronic systems qualification, including advanced microchips and components,is providing reliability specifications that match the variety of user applications. The standard HTOL qualification that is based on a single high‐voltage and high‐temperature burn‐in does not reflect actual failure mechanisms that would lead to a failure in the field. Rather, the manufacturer is expected to meet the system’s reliability criteria without any real knowledge of the possible failure causes or the relative importance of any individual mechanism. More than this, as a consequence of the non‐linear nature of individual mechanisms, it is impossible for the dominant mechanism at HTOL test reflect the expected dominant mechanism at operating conditions, essentially sweeping the potential cause of failure under the rug while generating an overly optimistic picture for the actual reliability.
Two problems exist with the current HTOL approach, as recognized by JEDEC in publication JEP122G:
Multiple failure mechanisms actually compete for dominance in our modern electronic devices and
Each mechanism has a vastly different voltage and temperature acceleration factors depending on the device operation.
This more recent JEDEC publication recommends explicitly that multiple mechanisms should be addressed in a sum‐of‐failure‐rates approach. We agree that a single point HTOL test with zero failures can, by no means, account for a multiplicity of competing mechanisms.
In order to address this fundamental limitation, we developed a special multiple‐mechanism qualification approach that allows companies to tailor specifications to a variety of customer’s needs. We call this approach the multiple temperature operational life (MTOL) test at multiple conditions and match the results with the foundry’s reliability models to make accurate FIT calculations based on specific customer’s environments including voltage, temperature, and speed. The basic strategy is outlined in Figure 1. Time to fail models are put into the matrix as failure rates (
This new MTOL system allows the FIT value to be calculated with the assumption of not just one but multiple degradation mechanisms that are characterized by multiple acceleration factors. This chapter will describe the advantages of considering multiple failure mechanisms and how they can be linearly combined with a simple matrix solution that accounts for each mechanism proportionally based on data rather than based on a zero‐failure result.
1.1. Limitation of traditional HTOL
The semiconductor industry provides an expected FIT for every product that is sold based on operation within the specified conditions of voltage, frequency, heat dissipation, etc. Hence, a system reliability model is a prediction of the expected MTBF’s, or as we will use here, mean time to fail (MTTF), for a system that is not replaced as the sum of the FIT rates for every component.
A FIT is defined in terms of an acceleration factor (AF) and MTTF as:
where #failures and #tested are the numbers of actual failures that occurred as a fraction of the total number of units subjected to an accelerated test per total test time in hours. From a statistical perspective, this calculation would be correct if there is only a single known mechanism that is completely characterized by a single acceleration factor, AF. However, if multiple mechanisms are present, there is no way to average the acceleration factor, and thus, the denominator cannot be characterized as one AF for any set of operating conditions. The true AF must be based on the physics of the actual mechanisms, including different activation energies for different physical processes. Without testing at multiple accelerated conditions, a standard HTOL qualification cannot distinguish effects of more than one thermally activated process, rather only give an approximation for the dominant mechanism at the test condition. The test consists of stressing some number of parts, usually around 77, for an extended time, usually 1000 h, at an accelerated voltage and temperature.
In order to excite multiple mechanisms, testing must be performed at multiple conditions of accelerated stress in order to obtain sufficient statistical data. Furthermore, there needs to be a statistically significant number of observed or extrapolated failures during the testing so that a proper average can be obtained. We cannot rely on a “zero failure” pass criterion when multiple mechanisms are involved since there needs to be a distinction between the effects of different accelerated stress conditions. The qualification tests are designed inevitably to result in zero failures, which allows the assumption (with only 60% confidence!) that no more than ½ a failure occurred during the accelerated test. The only fallacy with this approach is that the assumption is that the only dominant mechanism that would be seen during the test is the one with the reported AF. However, if that mechanism is not modelled or observed, there is no way to prove that this mechanism would actually be the cause of a field failure.
We don’t need to prove that in most systems, multiple failure mechanisms contribute to the overall reliability of a system. Reliability mathematics assumes that the influences are time‐independent, occurring at a constant rate, while each is independent of the others. In reality, most systems experience failures at approximately a constant rate, at least for the first few “random” occurrences. When we consider that the defects responsible for earlier failures are generally distributed in time, the assumption of multiple failure mechanisms makes valid sense as to why the random failures occurring during the useful life of a product will be, in fact, caused by not a single mechanism, but rather by a proportional combination of all the likely failure and wear‐out mechanisms. However, due to the physics involved with each cause of failure, each will be accelerated differently depending on the thermal, electrical, or environmental stresses that are responsible for each mechanism. Hence, when an accelerated test is performed at an arbitrary voltage and temperature for acceleration based only on a single failure mechanism, then, only that mechanism will be accelerated. When the failure rate (FIT) is calculated based on the non‐occurrence of a failure (i.e., zero failure assumption), then it is naturally over‐estimating the reliability by whatever factor was not introduced by the second or third mechanism that was not accounted for in the model.
Unfortunately for the test and qualification industry, the final test procedure and failure rate calculation have not kept pace with the depth of understanding that we have today about the actual failure mechanisms. Also, manufacturing processes are so tightly controlled that each known mechanism is designed to be theoretically non‐existent in the field. Thus, naturally, since there is no single mechanism that will cause a known end‐of‐life, so it is logical that multiple mechanisms will affect the final failure rate. Furthermore, HTOL tests are known to reveal multiple failure mechanisms during final qualification, which would suggest also that no single failure mechanism would dominate FIT in the field. Thus, finally, in order to make a more accurate model for FIT, a preferable approximation should be that all mechanisms contribute and the resulting overall failure distribution resembles combination of
1.2. MTOL methodology
The key innovation of the multiple temperature operational life (MTOL) testing method is its ability to separate different failure mechanisms so that predictions can be made for any user defined operating conditions. This is opposed to the common approach for assessing device reliability today, using high‐temperature operating life (HTOL) testing , which is based on the assumption that just one dominant failure mechanism is responsible for a failure of the device in the field . However, it is known that, in reality, multiple failure mechanisms act simultaneously on any system that causes failure based on more than a single mechanism at any time .
Our new approach, MTOL, deals with this issue . This method predicts the reliability of electronic components by combining separately measured FITS of multiple failure mechanisms . Our data reveal that different failure mechanisms act on a component in different regimes of operation causing different mechanisms to dominate, depending on the stress and the particular technology. When multiple mechanisms are known to affect the failure a product, then JEDEC standard publication JEP‐122G states that “
Because failure rates are linear and sum linearly only if they are all considered as constant rate processes, they can be combined linearly to calculate the actual reliability as measured in FIT of the system based on the physics of degradation at specific operating conditions. In a more recent publication , we present experimental results of the MTOL method tested on both 45 and 28 nm FPGA devices from Xilinx that were processed at TSMC (according to the Xilinx data sheets). The FPGAs were tested over a range of voltages, temperature and frequencies. We measured ring frequencies of multiple asynchronous ring oscillators simultaneously during stress in a single FPGA. Hundreds of oscillators and the corresponding frequency counters were burned into a single FPGA to monitor of statistical information in real time. Since the frequency of a ring oscillator itself monitors the device speed and performance, there is no recovery effect, giving a true measure for the effects of all the failure mechanisms in real time. Our results produced an acceleration factor (AF) for each failure mechanism as a function of core voltage, temperature and frequency.
The failure rates of all of the mechanisms were then combined using a matrix to normalize the AF of the mechanisms to find the overall failure in time or FIT of the device. In other words, we found an accurate estimate of the device’s mean lifetime and thus the reliability that can be conveniently transposed to other technologies and ASICs and not necessarily only FPGAs, as was the basis of our previous work. In this chapter, we show that the MTOL methodology is general and can apply to any system that is characterized by multiple failure mechanisms, which can individually be treated as approximately occurring at a constant rate, having its own FIT per mechanism.
2. Multiple mechanism considerations
The acceleration of the rate of occurrence of a single failure mechanism is a highly non‐linear function of temperature and/or voltage as is well known through studies of the physics of failure [3–5]. The temperature acceleration factor (
Calculated acceleration factors (AF) are universally used as the industry standard for device qualification. However, it only approximates a single dielectric breakdown type of failure mechanism and does not correctly predict the acceleration of other mechanisms. Similarly, an acceleration factor can be determined using any other type of stress applied, for example, vibration, radiation, number of cycles, etc. However, when only a single AF is assumed to contribute to the expected time to fail based on the high temperature, high voltage acceleration, there is no way to account for the effect of multiple mechanisms.
The goal here is to improve the approach from standard HTOL to a one where a true “sum of failure rates” model is considered based on a proportional contribution of each mechanism based on its relative influence. Each one mechanism acts on the system in combination with others to cause an eventual failure. When more than one mechanism affects the reliability of a system or component, then the relative acceleration of each one must be defined and calculated at the applied condition. Every potential failure mechanism should be identified, and its unique AF should then be relatively known at a given temperature and voltage so the FIT rate can be approximated separately for each mechanism. Thus, the actual FIT will be the sum of the failure rates per mechanism, as is described by:
whereby each mechanism is described by its own FIT, which leads to its own expected failure unit per mechanism, FIT
The qualification of device reliability, as reported by a FIT rate, must be based on an acceleration factor, which represents the failure model for the tested device. Since multiple mechanisms are known to lead to degradation and thus failure in any complex system, it is obvious that a single mechanism model with a single AF assumption will never produce a useful result for reliability prediction. This will be explained by way of example. Suppose there are two identifiable, constant rate competing failure modes (assume an exponential distribution). One failure mode is accelerated only by temperature. We denote its failure rate as AF
where the measured Mean Time To Fail (MTTF) (measured in hours) would be different for each mechanism. However, since only one condition of Voltage and Temperature is applied, yet the calculated FIT is based on a combination of two mechanisms, each with its own acceleration factor, then there is now way to determine which mechanism dominates. Because the effective acceleration factor for any given set of test conditions is related to the inverse of the acceleration factor, without separately testing each mechanism, the resulting FIT will have no relation to the actual tested results.
Due to the exponential nature of the acceleration factor as a function of
3. MTOL test system example
A test system was built in off‐the‐shelf Xilinx FPGA evaluation boards. The system ran hundreds of internal oscillators at several different frequencies asynchronously, allowing independent measurements across the chip and the separation of current versus voltage induced degradation effects. In order to create a measurable accelerating system, ring oscillators (ROs) consisting of inverter chains were used. The last inverter in the chain is connected to the first, forming a cycle/ring (Figure 2). When the number of stages is odd, every sampled cell in the chain will invert its logic level. Additionally, as no clock is fed into the RO, the frequency of the alternating logical states depends just on the internal delay of the cells and the latency of the connections between them, where the frequency of each RO is given by ½
For optimal testing and chip coverage, different sized ROs were selected, ranging from three inverters, giving the maximum frequency possible in accordance with the intrinsic delays of the FPGA employed (400–700 MHz), and up to 1001‐inverter oscillators, giving a much lower frequency (around 800 KHz). The system implemented on the chip starts operating immediately when the FPGA core voltage is connected. Using a wide range of ROs enabled us to measure the frequency and the internal delay of a real, de‐facto system on a chip. This allows seeing the frequency dependence of each failure mechanism without any recovery effect. The set of ROs consisted of:
150 oscillators of 3 stages
50 oscillators of 5 stages
20 oscillators of 33 stages
3 oscillators of 333 stages
1 oscillator of 1001 stages
It is important to note, here, that the size of the ring determines the interdependence of any degradation. The shortest oscillators containing only three stages will have the greatest variability as well as the highest frequency. This is because a shorter critical electrical path will be much more sensitive to minor variations that lead to greater or smaller degradation over time. This means that the lower frequency oscillators containing as many as 1001 stages will average out the effects of individual degradations. Furthermore, the random statistical variability of individual devices will be exaggerated by the statistical distribution in wear‐out slopes seen at high frequencies. Thus, we made 150 of the smallest ring size devices, which would need to be averaged to find the average degradation at those frequencies exhibiting more random times to fail. Interestingly, we see that the variability of three‐ring oscillators is quite diverse, nearly randomly distributed about an average, whereas the lower frequency rings are much more narrowly distributed, indicating a more predictable time to fail, as compared to circuits having a much shorter critical path.
3.1. Testing methods
The testing system was synthesized and downloaded to the FPGA card. The test conditions were predefined for allowing separation and characterization of the relative contributions of the various failure mechanisms by controlling voltage, temperature, and frequency. Extreme core voltages and environmental temperatures, beyond the specifications, were imposed to cause failure acceleration of individual mechanisms to dominate others at each condition, for example, sub‐zero temperatures, at very high operating voltages, to exaggerate HCI.
For each test, the FPGA board was placed in a temperature‐controlled oven, dedicated to the MTOL testing, with an appropriate voltage set at the FPGA core. The board was connected to a computer via USB and the external clock signal was fed into the chip. The tests performed for 200–500 h, while the device was working in the accelerated conditions. Frequencies of every ring oscillator, of different sizes, were measured. Initial sampling started after one working‐hour in the accelerated environment, and then, samples were taken automatically at 5‐min intervals. The frequency measurement data were stored in a database from which one could draw statistical information about the degradation in the device performance.
The acceleration conditions for each failure mechanism allowed us to examine the specific effect of voltage and temperature versus frequency on that particular mechanism at the system level and thus define its unique physical characteristics even from a finished product. A close inspection of test results yielded more precise parameters for the acceleration factors (AF) equations and allowed adjusting them to the device under test. Finally, after completing the tests, some of the experiments with different frequency, voltage and temperature conditions were chosen to construct the MTOL Matrix.
3.2. Separation of mechanisms
Our tests for various mechanisms included exposing the core of the FPGA to accelerating voltages above nominal. About 45 nm defines the nominal voltage at 1.2 V and for 28 and 20 nm, 1.0 V. Our method of separating mechanisms allowed the evaluation of actual activation energies for the three failure mechanisms, which are hot carrier injection (HCI), bias temperature instability (BTI) and electromigration (EM). We plotted the degradation in frequency and attributed it to one of the three failure mechanisms.
We need to justify our approach for accounting for current in the devices. Both‐and HCI have
The results of our experiments give both
The degradation slope, α, is measured as the degradation from initial frequency as an exponential decay, approximated by taking the difference in frequency, and divided by initial frequency over the time. In our experiments, we found that when the decay was dominated by BTI, the decay was proportional to the fourth root of time, while HCI and EM, being diffusion‐related mechanisms, have decay that is proportional to the square root of time , as seen in Figure 3.
In the 45 nm boards each oscillator, the ring frequency was measured and plotted against the square root of time in 45 nm devices. The slope, α, was then converted to a FIT for each test as determined by extrapolating the degradation slope to 10% degradation from its initial value. Each set is plotted as an exponential decay dependent on the square root of time as shown by example in Figure 3. This slope is then used to find the time to fail as seen in the development of FIT below (Eqs. (8)–(11)). We defined the exponent as 1/
The time to fail (TTF) was then calculated for as the square of the inverse slope times the failure criterion, which is 10% degradation in the 45 nm technology . Hence, the FIT for each slope is simply determined as the (10*
Two typical degradation plots are shown in Figure 4(a and b), the FITs, determined by the slopes, are plotted against frequency in two different experiments. The data demonstrate the clear advantage of RO generated frequencies in a single chip . In the examples of Figure 4, we see that FIT is directly proportional to frequency , consistent with Eq. (5). Figure 4(b) shows a chip that was stressed at high voltage and temperature showing a strong BTI degradation at low frequency and a much shallower slope due to EM in combination with a small HCI effect. Such curves were made for each experiment, incorporating all the oscillators across the chip spanning the range of frequencies, reflecting also the averaging effect of the longer chains. Hence, the variability is much lower than at higher frequencies, demonstrating that the averaging of many variations results in a consistent mean degradation. The slope of FIT versus frequency is then related at low temperatures as occurring only from HCI, while at higher voltages and temperatures, it can be due to BTI  and EM. BTI is only responsible for low frequency degradation.
In order to determine the dependence of each mechanism, the activation energy as relating to the temperature factor (
3.3. 1000 h extrapolation
We verified that the measurement to 1% degradation over relatively shorter times gives the same slope as longer term measurement that were carried all the way to 1000 h. We found that the failure criterion of 10% degradation was reached in these ring oscillators. This is seen in Figure 5, where the frequency was recorded at accelerated conditions all the way to 1000 h at various voltage and temperatures. The slopes are all very close to
4. Linear matrix solution
We assume here that the linear, Poisson, model for constant rate is associated with the probability of failure for each separable mechanism. As we showed in Eq. (3) above, each FIT adds linearly to the other FITs in order to obtain an average system failure rate. By observation of the procedure in Figure 1, it is clear that each FIT will have its own value that is uniquely determined by the acceleration factor for each mechanism depending on the voltage (
This approach is exactly what JEDEC describes as a sum of failure rates methodology as it sums the expected failure rate of each mechanism distinctly from the other mechanisms. The combination results from actual accelerated life tests where there is an extrapolated mean time to fail based on the known operating conditions of
Of course, we assume that each component is composed of multiple sub‐components, for example, a certain percentage is effectively ring‐oscillator, static SRAM, DRAM, etc. Each type of circuit, based on its operation, can be seen to affect the potential steady‐state (defect related) failure mechanisms differently based on the accelerated test conditions. However, unlike traditional reliability systems, rather than treat each sub‐system or component as an individual source with a known failure rate, we separate the system into distinct mechanisms that is known to have its own acceleration factor with voltage, temperature, frequency, cycles, etc. Hence, the standard system reliability FIT can be modeled using traditional MIL‐handbook‐217 type of algorithms and adapted to known system reliability tools; however, instead of treating each component as individuals, we propose treating each complex component as a series system of various mechanisms, each with its own reliability.
The matrix is arranged as in Table 1. The three most left‐hand columns show the temperature,
The second from the right‐hand column shows the ratio of the extrapolated failure rate and the calculated FIT. These values serve to show the closeness of fit to the model parameters by comparing the other measured FIT values with the calculations. This matrix will have a unique solution that will fit the percentages of each mechanism (
|−62.5||1.2||1||99.99%||0.01%||0.00%||30||94%||2.83E + 01|
|125||1.2||1||0.00%||86.86%||13.14%||997.4||102%||1.01E + 03|
|153||1.2||1||0.00%||63.79%||36.21%||3672||100%||3.67E + 03|
|−35||2.5||0.5||100.00%||0.00%||0.00%||23,750,000||100%||2.38E + 07|
|154||1.2||0||0.00%||100.00%||0.00%||2420||100%||2.42E + 03|
|140||2.2||0||0.00%||100.00%||0.00%||66,200||102%||6.76E + 04|
|−22.5||2.8||1||100.00%||0.00%||0.00%||240,000,000||101%||2.43E + 08|
|7.3||3||1||100.00%||0.00%||0.00%||156,000,000||106%||1.66E + 08|
Once the parameters for the three mechanisms have been calculated and verified against the other test data, a full set of extrapolated values for FIT can then be calculated using the equations for each mechanism times the same
|−4.36972 E−29||4.76285 E−18||−1.10403 E−20||1.13118 E−10|
|1.19767E + 14||−2040.515932||−1.15932E + 14||1.59226E + 17|
Since the matrix is linear, as are the calculations for FIT at any given
|−50||1.2||2||1.45382E + 11||2.84438 E−10||1.99008 E−27||16.5|
|−10||1.2||2||6,131,362,305||1.61083 E−08||2.65294 E−23||1.1|
|20||1.2||2||1,006,254,891||1.61337 E−07||6.00169 E−21||4.4|
|30||1.2||2||596,524,778.1||3.14239 E−07||2.8808 E−20||8.4|
|40||1.2||2||365,644,331.1||5.86524 E−07||1.2509 E−19||15.6|
|50||1.2||2||231,020,972.9||1.05325 E−06||4.95957 E−19||28.0|
|80||1.2||2||68,110,854.71||4.99845 E−06||1.9353 E−17||135.5|
|100||1.2||2||33,650,811.61||1.22819 E−05||1.60476 E−16||350.9|
The unique solution that solves all three equations with the three extrapolated acceleration factors gives a percentage contribution for each of the failure mechanisms. We report the reliability as FIT, which is 109/MTTF for each condition. The percentages for each mechanism are shown, based on the relative contributions that were extrapolated from the physics of failure equations normalized to the measured FIT of each test. Seeing the dispersion of FIT values per test proves that the approximation of a constant rate, meaning a random distribution in time, is the proper statistical model for these results. Figure 6 shows the resulting FIT as plotted versus temperature (°C) for the measured 45 nm technology FPGA.
One advantage of plotting our data as failure rate versus temperature allows one to determine effective activation energy as a function of temperature and stressor parameters,
If we assume that
Hence, if we plot the change in
The advantage of this representation allows a designer to consider the temperature rage as a function of stressor factors that would affect the reliability of a product, especially under extreme conditions. We see very clearly that at low frequencies, the reliability is completely dominated by BTI where the activation energy is around 0.53 eV, whereas at very high temperatures and very low temperatures, the effect of frequency becomes dominant. At the very low temperatures, a negative activation energy is seen for higher frequency operation, while at high temperatures, the EM effect becomes more important, both of which are current‐related effects; hence, they are frequency dependent.
What is most important to understand about this Matrix solution to linear, constant, failure rate models is that this methodology is not limited to only microelectronics. We must understand that all that is needed are the appropriate physics of failure relations to whatever stresses will be experienced during the expected life of the product. It is also important to know that this method of combining mechanisms is limited to failure mechanisms that have a generally constant rate over time. That is to say that the slope from a Weibull distribution is close to 1. If, however, the failure distribution of a particular mechanism is known to be highly predictable, that is with a wear‐out characteristic, having a Weibull slope of 2 or more, then this methodology will not properly work to combine mechanisms. On the other hand, if one mechanism is known to dominate or be the limitation, that one mechanism can be separated from the other more random mechanisms, as shown in Figure 7 and based on our extrapolation from Table 3.
One clear proof from this graph is that it is not possible to choose simply one accelerating temperature and voltage, or any one condition for any accelerated test, expecting that a simple extrapolation can be made based on a single failure mechanism. The mechanisms interact such that any accelerated test will surely give incorrect results, and, thus, the traditional HTOL test is obviously not sufficient for reliability prediction. Furthermore, the MTOL, multiple stressor qualification will give accurate prediction for the failure rate under any given operating conditions from a fraction of the number of samples tested over a much shorter period of time. Hence, this methodology will save a large proportion of the standard qualification procedure and give much more accurate and meaningful results.