Open access peer-reviewed chapter

Reliability Prediction Considering Multiple Failure Mechanisms

Written By

Joseph B. Bernstein

Submitted: 11 December 2016 Reviewed: 27 April 2017 Published: 20 December 2017

DOI: 10.5772/intechopen.69500

From the Edited Volume

System Reliability

Edited by Constantin Volosencu

Chapter metrics overview

1,600 Chapter Downloads

View Full Metrics

Abstract

The multiple temperature operational life (MTOL) testing method is used to calculate the failure in time (FIT) by a linear combination of constant‐rate failure mechanisms. This chapter demonstrates that, unlike other conventional qualification procedures, the MTOL testing procedure gives a broad description of the reliability from sub‐zero to high temperatures. This procedure can replace the more standard single‐condition high‐temperature operational life (HTOL) by predicting the system failure rate by testing a small number of components over more extreme accelerated conditions for much shorter times than is conventionally used. The result is a much more accurate result for the failure rate, calculating the mean time to failure (MTTF) based on much shorter time‐scale testing only a fraction of the number of components. Rather than testing 77 parts for 1000 h, a failure rate prediction can be obtained from as few as 15 parts tested for only 200 h with reliable results.

Keywords

  • MTTF
  • MTOL
  • HTOL
  • FIT
  • failure rate
  • multiple mechanisms

1. Introduction to MTOL

Traditional high‐temperature operational life (HTOL) test strategy is based on the outdated JEDEC standard that has not been supported or updated for many years. The major drawback of this method is that it is not based on a model that predicts failures in the field. Nonetheless, the electronics industry continues to provide data from tests of fewer than 100 parts, subjected to their maximum allowed voltages and temperatures for as many as 1000 h. The result based on zero, or a maximum of one, failure out of the number of parts tested does not actually predict. This null result is then fit into an average acceleration factor (AF), which is the product of a thermal factor and a voltage factor. The result is a reported failure rate as described by the standard failure in time (FIT) model, which is the number of expected failures per billion part hours of operation. FIT is still an important metric for failure rate in today’s technology; however, it does not account for the fact that multiple failure mechanisms simply cannot be averaged for either thermal or voltage acceleration factors.

One of the major limitations of advanced electronic systems qualification, including advanced microchips and components,is providing reliability specifications that match the variety of user applications. The standard HTOL qualification that is based on a single high‐voltage and high‐temperature burn‐in does not reflect actual failure mechanisms that would lead to a failure in the field. Rather, the manufacturer is expected to meet the system’s reliability criteria without any real knowledge of the possible failure causes or the relative importance of any individual mechanism. More than this, as a consequence of the non‐linear nature of individual mechanisms, it is impossible for the dominant mechanism at HTOL test reflect the expected dominant mechanism at operating conditions, essentially sweeping the potential cause of failure under the rug while generating an overly optimistic picture for the actual reliability.

Two problems exist with the current HTOL approach, as recognized by JEDEC in publication JEP122G:

  1. Multiple failure mechanisms actually compete for dominance in our modern electronic devices and

  2. Each mechanism has a vastly different voltage and temperature acceleration factors depending on the device operation.

This more recent JEDEC publication recommends explicitly that multiple mechanisms should be addressed in a sum‐of‐failure‐rates approach. We agree that a single point HTOL test with zero failures can, by no means, account for a multiplicity of competing mechanisms.

In order to address this fundamental limitation, we developed a special multiple‐mechanism qualification approach that allows companies to tailor specifications to a variety of customer’s needs. We call this approach the multiple temperature operational life (MTOL) test at multiple conditions and match the results with the foundry’s reliability models to make accurate FIT calculations based on specific customer’s environments including voltage, temperature, and speed. The basic strategy is outlined in Figure 1. Time to fail models are put into the matrix as failure rates (λι) for each given set of conditions of temperature, voltage, frequency, etc. Then, the left‐hand side of the matrix takes measured failure rates as extrapolated from measurements tested on the actual system under investigation. Hence, the relative acceleration factor for each mechanism is calculated based on measurement data rather than from zero‐failures, as is traditional done through HTOL.

Figure 1.

Matrix methodology for reliability prediction.

This new MTOL system allows the FIT value to be calculated with the assumption of not just one but multiple degradation mechanisms that are characterized by multiple acceleration factors. This chapter will describe the advantages of considering multiple failure mechanisms and how they can be linearly combined with a simple matrix solution that accounts for each mechanism proportionally based on data rather than based on a zero‐failure result.

1.1. Limitation of traditional HTOL

The semiconductor industry provides an expected FIT for every product that is sold based on operation within the specified conditions of voltage, frequency, heat dissipation, etc. Hence, a system reliability model is a prediction of the expected MTBF’s, or as we will use here, mean time to fail (MTTF), for a system that is not replaced as the sum of the FIT rates for every component.

A FIT is defined in terms of an acceleration factor (AF) and MTTF as:

E1

where #failures and #tested are the numbers of actual failures that occurred as a fraction of the total number of units subjected to an accelerated test per total test time in hours. From a statistical perspective, this calculation would be correct if there is only a single known mechanism that is completely characterized by a single acceleration factor, AF. However, if multiple mechanisms are present, there is no way to average the acceleration factor, and thus, the denominator cannot be characterized as one AF for any set of operating conditions. The true AF must be based on the physics of the actual mechanisms, including different activation energies for different physical processes. Without testing at multiple accelerated conditions, a standard HTOL qualification cannot distinguish effects of more than one thermally activated process, rather only give an approximation for the dominant mechanism at the test condition. The test consists of stressing some number of parts, usually around 77, for an extended time, usually 1000 h, at an accelerated voltage and temperature.

In order to excite multiple mechanisms, testing must be performed at multiple conditions of accelerated stress in order to obtain sufficient statistical data. Furthermore, there needs to be a statistically significant number of observed or extrapolated failures during the testing so that a proper average can be obtained. We cannot rely on a “zero failure” pass criterion when multiple mechanisms are involved since there needs to be a distinction between the effects of different accelerated stress conditions. The qualification tests are designed inevitably to result in zero failures, which allows the assumption (with only 60% confidence!) that no more than ½ a failure occurred during the accelerated test. The only fallacy with this approach is that the assumption is that the only dominant mechanism that would be seen during the test is the one with the reported AF. However, if that mechanism is not modelled or observed, there is no way to prove that this mechanism would actually be the cause of a field failure.

We don’t need to prove that in most systems, multiple failure mechanisms contribute to the overall reliability of a system. Reliability mathematics assumes that the influences are time‐independent, occurring at a constant rate, while each is independent of the others. In reality, most systems experience failures at approximately a constant rate, at least for the first few “random” occurrences. When we consider that the defects responsible for earlier failures are generally distributed in time, the assumption of multiple failure mechanisms makes valid sense as to why the random failures occurring during the useful life of a product will be, in fact, caused by not a single mechanism, but rather by a proportional combination of all the likely failure and wear‐out mechanisms. However, due to the physics involved with each cause of failure, each will be accelerated differently depending on the thermal, electrical, or environmental stresses that are responsible for each mechanism. Hence, when an accelerated test is performed at an arbitrary voltage and temperature for acceleration based only on a single failure mechanism, then, only that mechanism will be accelerated. When the failure rate (FIT) is calculated based on the non‐occurrence of a failure (i.e., zero failure assumption), then it is naturally over‐estimating the reliability by whatever factor was not introduced by the second or third mechanism that was not accounted for in the model.

Unfortunately for the test and qualification industry, the final test procedure and failure rate calculation have not kept pace with the depth of understanding that we have today about the actual failure mechanisms. Also, manufacturing processes are so tightly controlled that each known mechanism is designed to be theoretically non‐existent in the field. Thus, naturally, since there is no single mechanism that will cause a known end‐of‐life, so it is logical that multiple mechanisms will affect the final failure rate. Furthermore, HTOL tests are known to reveal multiple failure mechanisms during final qualification, which would suggest also that no single failure mechanism would dominate FIT in the field. Thus, finally, in order to make a more accurate model for FIT, a preferable approximation should be that all mechanisms contribute and the resulting overall failure distribution resembles combination of constant failure rate processes that is consistent with the mil‐handbook and JEDEC standards.

1.2. MTOL methodology

The key innovation of the multiple temperature operational life (MTOL) testing method is its ability to separate different failure mechanisms so that predictions can be made for any user defined operating conditions. This is opposed to the common approach for assessing device reliability today, using high‐temperature operating life (HTOL) testing [1], which is based on the assumption that just one dominant failure mechanism is responsible for a failure of the device in the field [2]. However, it is known that, in reality, multiple failure mechanisms act simultaneously on any system that causes failure based on more than a single mechanism at any time [3].

Our new approach, MTOL, deals with this issue [4]. This method predicts the reliability of electronic components by combining separately measured FITS of multiple failure mechanisms [5]. Our data reveal that different failure mechanisms act on a component in different regimes of operation causing different mechanisms to dominate, depending on the stress and the particular technology. When multiple mechanisms are known to affect the failure a product, then JEDEC standard publication JEP‐122G states that “When multiple failure mechanisms and thus multiple acceleration factors are involved, then a proper summation technique, for example, sum‐of‐the‐failure rates method, is required.” The only question that is not answered by the JEDEC standard is how to “sum” the failure rates. As a practical solution to reaching the desired goal of a calculated failure rate that combines multiple mechanisms, we have proposed the following approach.

Because failure rates are linear and sum linearly only if they are all considered as constant rate processes, they can be combined linearly to calculate the actual reliability as measured in FIT of the system based on the physics of degradation at specific operating conditions. In a more recent publication [6], we present experimental results of the MTOL method tested on both 45 and 28 nm FPGA devices from Xilinx that were processed at TSMC (according to the Xilinx data sheets). The FPGAs were tested over a range of voltages, temperature and frequencies. We measured ring frequencies of multiple asynchronous ring oscillators simultaneously during stress in a single FPGA. Hundreds of oscillators and the corresponding frequency counters were burned into a single FPGA to monitor of statistical information in real time. Since the frequency of a ring oscillator itself monitors the device speed and performance, there is no recovery effect, giving a true measure for the effects of all the failure mechanisms in real time. Our results produced an acceleration factor (AF) for each failure mechanism as a function of core voltage, temperature and frequency.

The failure rates of all of the mechanisms were then combined using a matrix to normalize the AF of the mechanisms to find the overall failure in time or FIT of the device. In other words, we found an accurate estimate of the device’s mean lifetime and thus the reliability that can be conveniently transposed to other technologies and ASICs and not necessarily only FPGAs, as was the basis of our previous work. In this chapter, we show that the MTOL methodology is general and can apply to any system that is characterized by multiple failure mechanisms, which can individually be treated as approximately occurring at a constant rate, having its own FIT per mechanism.

Advertisement

2. Multiple mechanism considerations

The acceleration of the rate of occurrence of a single failure mechanism is a highly non‐linear function of temperature and/or voltage as is well known through studies of the physics of failure [35]. The temperature acceleration factor (AFT) and voltage acceleration factor (AFV) can be calculated separately for each known mechanism in a combined model. The total acceleration factor of the different stress combinations for each mechanism will be the product of the acceleration factors of temperature and voltage or any other stress‐related factor that could include current, frequency, humidity, etc.

E2

Calculated acceleration factors (AF) are universally used as the industry standard for device qualification. However, it only approximates a single dielectric breakdown type of failure mechanism and does not correctly predict the acceleration of other mechanisms. Similarly, an acceleration factor can be determined using any other type of stress applied, for example, vibration, radiation, number of cycles, etc. However, when only a single AF is assumed to contribute to the expected time to fail based on the high temperature, high voltage acceleration, there is no way to account for the effect of multiple mechanisms.

The goal here is to improve the approach from standard HTOL to a one where a true “sum of failure rates” model is considered based on a proportional contribution of each mechanism based on its relative influence. Each one mechanism acts on the system in combination with others to cause an eventual failure. When more than one mechanism affects the reliability of a system or component, then the relative acceleration of each one must be defined and calculated at the applied condition. Every potential failure mechanism should be identified, and its unique AF should then be relatively known at a given temperature and voltage so the FIT rate can be approximated separately for each mechanism. Thus, the actual FIT will be the sum of the failure rates per mechanism, as is described by:

FIT total  = FIT 1  + FIT 2  + … + FIT i E3

whereby each mechanism is described by its own FIT, which leads to its own expected failure unit per mechanism, FITi. Therefore, it is impossible to accelerate more than one mechanism with a single set of accelerated stress conditions. Thus, requiring that more than a single test is necessary to determine what would be the actual FIT that would be found in any given expected operating conditions.

The qualification of device reliability, as reported by a FIT rate, must be based on an acceleration factor, which represents the failure model for the tested device. Since multiple mechanisms are known to lead to degradation and thus failure in any complex system, it is obvious that a single mechanism model with a single AF assumption will never produce a useful result for reliability prediction. This will be explained by way of example. Suppose there are two identifiable, constant rate competing failure modes (assume an exponential distribution). One failure mode is accelerated only by temperature. We denote its failure rate as AFT1 other failure mode is only accelerated by voltage, AFV2, and the corresponding failure rate is denoted as

E4

where the measured Mean Time To Fail (MTTF) (measured in hours) would be different for each mechanism. However, since only one condition of Voltage and Temperature is applied, yet the calculated FIT is based on a combination of two mechanisms, each with its own acceleration factor, then there is now way to determine which mechanism dominates. Because the effective acceleration factor for any given set of test conditions is related to the inverse of the acceleration factor, without separately testing each mechanism, the resulting FIT will have no relation to the actual tested results.

Due to the exponential nature of the acceleration factor as a function of V or T, or any other stress‐inducing parameter, including vibration, humidity, radiation, etc, if only a single parameter is changed, then it is not likely for more than one mechanism to be accelerated significantly compared to the others. As we will see in the next section, at least three mechanisms should be considered, many more perhaps depending on the system and use environment. Also, each voltage and temperature dependencies must be considered separately for each mechanism in order to make a reasonable reliability model for the whole device.

Advertisement

3. MTOL test system example

A test system was built in off‐the‐shelf Xilinx FPGA evaluation boards. The system ran hundreds of internal oscillators at several different frequencies asynchronously, allowing independent measurements across the chip and the separation of current versus voltage induced degradation effects. In order to create a measurable accelerating system, ring oscillators (ROs) consisting of inverter chains were used. The last inverter in the chain is connected to the first, forming a cycle/ring (Figure 2). When the number of stages is odd, every sampled cell in the chain will invert its logic level. Additionally, as no clock is fed into the RO, the frequency of the alternating logical states depends just on the internal delay of the cells and the latency of the connections between them, where the frequency of each RO is given by ½NTp, where N is the number of inverters and Tp is the propagation delay of each inverter. Each inverter chain was implemented as a complete logical cell using predefined Xilinx primitives, and thus, each ring oscillator was made up of the basic components of the FPGA. When degradation occurred in the FPGA, a decrease in performance and frequency of the RO could be observed and attributed to either increase in resistance or change in threshold voltage for the transistors.

Figure 2.

Ring oscillator is made of 2N + 1 Inverters connected in a chain.

For optimal testing and chip coverage, different sized ROs were selected, ranging from three inverters, giving the maximum frequency possible in accordance with the intrinsic delays of the FPGA employed (400–700 MHz), and up to 1001‐inverter oscillators, giving a much lower frequency (around 800 KHz). The system implemented on the chip starts operating immediately when the FPGA core voltage is connected. Using a wide range of ROs enabled us to measure the frequency and the internal delay of a real, de‐facto system on a chip. This allows seeing the frequency dependence of each failure mechanism without any recovery effect. The set of ROs consisted of:

  • 150 oscillators of 3 stages

  • 50 oscillators of 5 stages

  • 20 oscillators of 33 stages

  • 3 oscillators of 333 stages

  • 1 oscillator of 1001 stages

It is important to note, here, that the size of the ring determines the interdependence of any degradation. The shortest oscillators containing only three stages will have the greatest variability as well as the highest frequency. This is because a shorter critical electrical path will be much more sensitive to minor variations that lead to greater or smaller degradation over time. This means that the lower frequency oscillators containing as many as 1001 stages will average out the effects of individual degradations. Furthermore, the random statistical variability of individual devices will be exaggerated by the statistical distribution in wear‐out slopes seen at high frequencies. Thus, we made 150 of the smallest ring size devices, which would need to be averaged to find the average degradation at those frequencies exhibiting more random times to fail. Interestingly, we see that the variability of three‐ring oscillators is quite diverse, nearly randomly distributed about an average, whereas the lower frequency rings are much more narrowly distributed, indicating a more predictable time to fail, as compared to circuits having a much shorter critical path.

3.1. Testing methods

The testing system was synthesized and downloaded to the FPGA card. The test conditions were predefined for allowing separation and characterization of the relative contributions of the various failure mechanisms by controlling voltage, temperature, and frequency. Extreme core voltages and environmental temperatures, beyond the specifications, were imposed to cause failure acceleration of individual mechanisms to dominate others at each condition, for example, sub‐zero temperatures, at very high operating voltages, to exaggerate HCI.

For each test, the FPGA board was placed in a temperature‐controlled oven, dedicated to the MTOL testing, with an appropriate voltage set at the FPGA core. The board was connected to a computer via USB and the external clock signal was fed into the chip. The tests performed for 200–500 h, while the device was working in the accelerated conditions. Frequencies of every ring oscillator, of different sizes, were measured. Initial sampling started after one working‐hour in the accelerated environment, and then, samples were taken automatically at 5‐min intervals. The frequency measurement data were stored in a database from which one could draw statistical information about the degradation in the device performance.

The acceleration conditions for each failure mechanism allowed us to examine the specific effect of voltage and temperature versus frequency on that particular mechanism at the system level and thus define its unique physical characteristics even from a finished product. A close inspection of test results yielded more precise parameters for the acceleration factors (AF) equations and allowed adjusting them to the device under test. Finally, after completing the tests, some of the experiments with different frequency, voltage and temperature conditions were chosen to construct the MTOL Matrix.

3.2. Separation of mechanisms

Our tests for various mechanisms included exposing the core of the FPGA to accelerating voltages above nominal. About 45 nm defines the nominal voltage at 1.2 V and for 28 and 20 nm, 1.0 V. Our method of separating mechanisms allowed the evaluation of actual activation energies for the three failure mechanisms, which are hot carrier injection (HCI), bias temperature instability (BTI) and electromigration (EM). We plotted the degradation in frequency and attributed it to one of the three failure mechanisms.

We need to justify our approach for accounting for current in the devices. Both‐and HCI have Jγ factors; however, in a completely packaged CMOS digital circuit, there is no way to directly measure current, I, or as current density, J. We assume, in our experiments, that the stored gate charge strictly determines current transferred for each switching transition, that is, from a 0 to a 1 and vise verse. Whatever the current is for any state‐transition will be same for each transition and the current will be, therefore, directly proportional to the frequency. Hence, the degradation for each transition will be directly proportional to the measured frequency, f. The voltage exponent will depend on the frequency, but the exponent, γ, measured will be the effective voltage acceleration parameter and comes into the equations for EM and HCI at f*Vγ.

The results of our experiments give both EA and γ for the three mechanisms we studied at temperatures ranging from −50 to 150°C. The Eyring model [1] is utilized here to describe the failure in time (FIT) for all of the failure mechanisms. The specific FIT of each failure mechanisms follows these formulae:

E5

E6

E7

The degradation slope, α, is measured as the degradation from initial frequency as an exponential decay, approximated by taking the difference in frequency, and divided by initial frequency over the time. In our experiments, we found that when the decay was dominated by BTI, the decay was proportional to the fourth root of time, while HCI and EM, being diffusion‐related mechanisms, have decay that is proportional to the square root of time [2], as seen in Figure 3.

Figure 3.

Typical frequency versus square root of time showing degradation slope α.

In the 45 nm boards each oscillator, the ring frequency was measured and plotted against the square root of time in 45 nm devices. The slope, α, was then converted to a FIT for each test as determined by extrapolating the degradation slope to 10% degradation from its initial value. Each set is plotted as an exponential decay dependent on the square root of time as shown by example in Figure 3. This slope is then used to find the time to fail as seen in the development of FIT below (Eqs. (8)(11)). We defined the exponent as 1/n so that we can apply the degradation based on the square‐root law, as it does the degradation that is dominated by HCI or EM. For 28 and 20 nm technology and below, we found that n = 4 fits much more closely, as seen later with our 1000 h evaluation.

E8

E9

E10

E11

The time to fail (TTF) was then calculated for as the square of the inverse slope times the failure criterion, which is 10% degradation in the 45 nm technology [1]. Hence, the FIT for each slope is simply determined as the (10*α)2, where n = 2. The average FIT is the metric to determine the reliability since that corresponds to the MTTF in Eq. (9). This FIT value is plotted as a function of the frequency in order to determine the failure mechanisms and to fit the model parameters for each mechanism.

Two typical degradation plots are shown in Figure 4(a and b), the FITs, determined by the slopes, are plotted against frequency in two different experiments. The data demonstrate the clear advantage of RO generated frequencies in a single chip [4]. In the examples of Figure 4, we see that FIT is directly proportional to frequency [6], consistent with Eq. (5). Figure 4(b) shows a chip that was stressed at high voltage and temperature showing a strong BTI degradation at low frequency and a much shallower slope due to EM in combination with a small HCI effect. Such curves were made for each experiment, incorporating all the oscillators across the chip spanning the range of frequencies, reflecting also the averaging effect of the longer chains. Hence, the variability is much lower than at higher frequencies, demonstrating that the averaging of many variations results in a consistent mean degradation. The slope of FIT versus frequency is then related at low temperatures as occurring only from HCI, while at higher voltages and temperatures, it can be due to BTI [6] and EM. BTI is only responsible for low frequency degradation.

Figure 4.

Failure rate, FIT/1000, versus frequency in MHz for (a) HCI, stressed at −35°C with 2.0 V core voltage and (b) BTI, stressed at 145°C with 2.4 V at the core.

In order to determine the dependence of each mechanism, the activation energy as relating to the temperature factor (TF) and voltage acceleration factors (VF) is determined from Eqs. (2) to (4) and is presented in Ref. [6].

3.3. 1000 h extrapolation

We verified that the measurement to 1% degradation over relatively shorter times gives the same slope as longer term measurement that were carried all the way to 1000 h. We found that the failure criterion of 10% degradation was reached in these ring oscillators. This is seen in Figure 5, where the frequency was recorded at accelerated conditions all the way to 1000 h at various voltage and temperatures. The slopes are all very close to t¼, as seen to by typical of the 28 nm devices. These are the devices that show only BTI, which is consistent with a ¼ power law signature. Furthermore, we see that the initial degradation extrapolates to the 10% failure criterion, verifying the approach to measure for only a few hundred hours instead of requiring a complete 1000‐h test.

Figure 5.

1000 h degradation data for 28 nm devices over a range of core voltages from 1.3 to 1.6 V at 30 and 120°C as indicated.

Advertisement

4. Linear matrix solution

We assume here that the linear, Poisson, model for constant rate is associated with the probability of failure for each separable mechanism. As we showed in Eq. (3) above, each FIT adds linearly to the other FITs in order to obtain an average system failure rate. By observation of the procedure in Figure 1, it is clear that each FIT will have its own value that is uniquely determined by the acceleration factor for each mechanism depending on the voltage (V), temperature (T), and operational fequency (F). For this example, we found that there was no evidence of a time‐dependent dielectric breakdown (TDDB), and therefore, we included only HCI, BTI, and EM.

This approach is exactly what JEDEC describes as a sum of failure rates methodology as it sums the expected failure rate of each mechanism distinctly from the other mechanisms. The combination results from actual accelerated life tests where there is an extrapolated mean time to fail based on the known operating conditions of V, T and F. Hence, we are sure to test at a large range of Temperatures, including very high and very low temperatures as well as core voltage as high as will be practical to achieve reliable operation.

Of course, we assume that each component is composed of multiple sub‐components, for example, a certain percentage is effectively ring‐oscillator, static SRAM, DRAM, etc. Each type of circuit, based on its operation, can be seen to affect the potential steady‐state (defect related) failure mechanisms differently based on the accelerated test conditions. However, unlike traditional reliability systems, rather than treat each sub‐system or component as an individual source with a known failure rate, we separate the system into distinct mechanisms that is known to have its own acceleration factor with voltage, temperature, frequency, cycles, etc. Hence, the standard system reliability FIT can be modeled using traditional MIL‐handbook‐217 type of algorithms and adapted to known system reliability tools; however, instead of treating each component as individuals, we propose treating each complex component as a series system of various mechanisms, each with its own reliability.

The matrix is arranged as in Table 1. The three most left‐hand columns show the temperature, T, voltage, V, and frequency, F, used for the accelerated test. The measured value for FIT is then put in the third column from the right, after the relative (un‐normalized) calculations for each mechanism are placed below the column describing the mechanism. Here, they are labeled HCI, BTI, and EM. Any three rows can be used to solve the matrix, and the product of the solution parameters is then put in the FIT column on the right‐hand side of the matrix. The three rows that are used to solve the matrix will then have exactly the same solution as the measured FIT values used to calibrate the matrix.

The second from the right‐hand column shows the ratio of the extrapolated failure rate and the calculated FIT. These values serve to show the closeness of fit to the model parameters by comparing the other measured FIT values with the calculations. This matrix will have a unique solution that will fit the percentages of each mechanism (Pi) with the measured failure rate, FIT.

T (°C) V F (GHz) HCI BTI EM Measured Ratio FIT
−62.5 1.2 1 99.99% 0.01% 0.00% 30 94% 2.83E + 01
125 1.2 1 0.00% 86.86% 13.14% 997.4 102% 1.01E + 03
153 1.2 1 0.00% 63.79% 36.21% 3672 100% 3.67E + 03
−35 2.5 0.5 100.00% 0.00% 0.00% 23,750,000 100% 2.38E + 07
154 1.2 0 0.00% 100.00% 0.00% 2420 100% 2.42E + 03
140 2.2 0 0.00% 100.00% 0.00% 66,200 102% 6.76E + 04
−22.5 2.8 1 100.00% 0.00% 0.00% 240,000,000 101% 2.43E + 08
7.3 3 1 100.00% 0.00% 0.00% 156,000,000 106% 1.66E + 08

Table 1.

Test results showing proportions of failure mechanisms for given V, T, and F compared with the calculated as well as the measured failure rate (FIT).

Once the parameters for the three mechanisms have been calculated and verified against the other test data, a full set of extrapolated values for FIT can then be calculated using the equations for each mechanism times the same P values used to fit the three exemplary rows. Table 2 shows the inverse matrix of the values under the three mechanisms with the corresponding P values for HCI, BTI, and EM, respectively.

Inverse matrix Pi
−4.36972 E−29 4.76285 E−18 −1.10403 E−20 1.13118 E−10
0 0 10,946.04333 26,489,424.87
1.19767E + 14 −2040.515932 −1.15932E + 14 1.59226E + 17

Table 2.

Inverse matrix (left three columns) and respective P values (right‐hand column).

Since the matrix is linear, as are the calculations for FIT at any given T, V and F, then the full matrix of actual FIT calculations is simply the sum of each P value times the calculated relative significance of each mechanism. A calculated reliability curve is shown in Figure 6 across the full range of expected FIT versus temperature for any set of operational conditions shown in Table 3.

Figure 6.

Reliability curves for 45 nm technology showing FIT versus temperature for voltages. These curves are for 1.0, 1.2, and 1.4 V core voltage at 10 MHz (dashed) and at 1 GHz (solid).

T (°C) V F (GHz) HCI BTI EM FIT
−50 1.2 2 1.45382E + 11 2.84438 E−10 1.99008 E−27 16.5
−10 1.2 2 6,131,362,305 1.61083 E−08 2.65294 E−23 1.1
20 1.2 2 1,006,254,891 1.61337 E−07 6.00169 E−21 4.4
30 1.2 2 596,524,778.1 3.14239 E−07 2.8808 E−20 8.4
40 1.2 2 365,644,331.1 5.86524 E−07 1.2509 E−19 15.6
50 1.2 2 231,020,972.9 1.05325 E−06 4.95957 E−19 28.0
80 1.2 2 68,110,854.71 4.99845 E−06 1.9353 E−17 135.5
100 1.2 2 33,650,811.61 1.22819 E−05 1.60476 E−16 350.9

Table 3.

Calculated FIT based on the solved matrix for typical use conditions.

The unique solution that solves all three equations with the three extrapolated acceleration factors gives a percentage contribution for each of the failure mechanisms. We report the reliability as FIT, which is 109/MTTF for each condition. The percentages for each mechanism are shown, based on the relative contributions that were extrapolated from the physics of failure equations normalized to the measured FIT of each test. Seeing the dispersion of FIT values per test proves that the approximation of a constant rate, meaning a random distribution in time, is the proper statistical model for these results. Figure 6 shows the resulting FIT as plotted versus temperature (°C) for the measured 45 nm technology FPGA.

One advantage of plotting our data as failure rate versus temperature allows one to determine effective activation energy as a function of temperature and stressor parameters, V and F. The principle follows the assumption that the failure rate, λ, is exponentially dependent on the activation energy divided by the absolute temperature, T:

E12

If we assume that EA,eff is a function of V and F, we can calculate this from our solutions shown above,

E13

Hence, if we plot the change in λ by (1/kT) divided by λ at any temperature, T, we get

E14

The advantage of this representation allows a designer to consider the temperature rage as a function of stressor factors that would affect the reliability of a product, especially under extreme conditions. We see very clearly that at low frequencies, the reliability is completely dominated by BTI where the activation energy is around 0.53 eV, whereas at very high temperatures and very low temperatures, the effect of frequency becomes dominant. At the very low temperatures, a negative activation energy is seen for higher frequency operation, while at high temperatures, the EM effect becomes more important, both of which are current‐related effects; hence, they are frequency dependent.

What is most important to understand about this Matrix solution to linear, constant, failure rate models is that this methodology is not limited to only microelectronics. We must understand that all that is needed are the appropriate physics of failure relations to whatever stresses will be experienced during the expected life of the product. It is also important to know that this method of combining mechanisms is limited to failure mechanisms that have a generally constant rate over time. That is to say that the slope from a Weibull distribution is close to 1. If, however, the failure distribution of a particular mechanism is known to be highly predictable, that is with a wear‐out characteristic, having a Weibull slope of 2 or more, then this methodology will not properly work to combine mechanisms. On the other hand, if one mechanism is known to dominate or be the limitation, that one mechanism can be separated from the other more random mechanisms, as shown in Figure 7 and based on our extrapolation from Table 3.

Figure 7.

Activation energy versus temperature based on the data above in Figure 6 for the same voltages, 1.0, 1.2, and 1.4 V on the 45 nm FPGA technology.

One clear proof from this graph is that it is not possible to choose simply one accelerating temperature and voltage, or any one condition for any accelerated test, expecting that a simple extrapolation can be made based on a single failure mechanism. The mechanisms interact such that any accelerated test will surely give incorrect results, and, thus, the traditional HTOL test is obviously not sufficient for reliability prediction. Furthermore, the MTOL, multiple stressor qualification will give accurate prediction for the failure rate under any given operating conditions from a fraction of the number of samples tested over a much shorter period of time. Hence, this methodology will save a large proportion of the standard qualification procedure and give much more accurate and meaningful results.

References

  1. 1. Xilinx, Device Reliability Report, UG116 (v10.3.1), 8 September 2015 (As an example)
  2. 2. Bernstein JB. Reliability prediction from Burn‐in Data fit to reliability models. 2014 (ISBN–10: 0–128007–47–8) Academic Press
  3. 3. Bernstein JB. et al. Physics‐of‐failure based handbook of microelectronic systems, Reliability Information Analysis Center, Utica, NY; 2008 (ISBN‐10: 1 933904–29‐1)
  4. 4. Bernstein JB, Gabbay M, Delly O. Reliability matrix solution to multiple mechanism prediction. Microelectronics Reliability. 2014;54:2951–2955
  5. 5. Bernstein JB, Reliability Prediction for Aerospace Electronics Descriptive Note : Final rept. 15 Jul 014-14 Apr 2015 Accession Number : ADA621707
  6. 6. Joseph B, Bernstein A, Bender BE. Reliability prediction with MTOL. Microelectronics Reliability. 2017;68:91–97

Written By

Joseph B. Bernstein

Submitted: 11 December 2016 Reviewed: 27 April 2017 Published: 20 December 2017