Vulnerability Analysis and Risk Assessment for SoCs Used in Safety-Critical Embedded Systems

Intelligent systems, such as intelligent automotive systems or intelligent robots, require a rigorous reliability/safety while the systems are in operation. As system-on-chip (SoC) becomes more and more complicated, the SoC could encounter the reliability problem due to the increased likelihood of faults or radiation-induced soft errors especially when the chip fabrication enters the very deep submicron technology [Baumann, 2005; Constantinescu, 2002; Karnik et al., 2004; Zorian et al., 2005]. SoC becomes prevalent in the intelligent safetyrelated applications, and therefore, fault-robust design with the safety validation is required to guarantee that the developed SoC is able to comply with the safety requirements defined by the international norms, such as IEC 61508 [Brown, 2000; International Electrotechnical Commission [IEC], 1998-2000]. Therefore, safety attribute plays a key metric in the design of SoC systems. It is essential to perform the safety validation and risk reduction process to guarantee the safety metric of SoC before it is being put to use.


Introduction
Intelligent systems, such as intelligent automotive systems or intelligent robots, require a rigorous reliability/safety while the systems are in operation. As system-on-chip (SoC) becomes more and more complicated, the SoC could encounter the reliability problem due to the increased likelihood of faults or radiation-induced soft errors especially when the chip fabrication enters the very deep submicron technology [Baumann, 2005;Constantinescu, 2002;Karnik et al., 2004;Zorian et al., 2005]. SoC becomes prevalent in the intelligent safetyrelated applications, and therefore, fault-robust design with the safety validation is required to guarantee that the developed SoC is able to comply with the safety requirements defined by the international norms, such as IEC 61508 [Brown, 2000;International Electrotechnical Commission [IEC], 1998-2000. Therefore, safety attribute plays a key metric in the design of SoC systems. It is essential to perform the safety validation and risk reduction process to guarantee the safety metric of SoC before it is being put to use.
If the system safety level is not adequate, the risk reduction process, which consists of the vulnerability analysis and fault-robust design, is activated to raise the safety to the required level. For the complicated IP-based SoCs or embedded systems, it is unpractical and not cost-effective to protect the entire SoC or system. Analyzing the vulnerability of microprocessors or SoCs can help designers not only invest limited resources on the most crucial regions but also understand the gain derived from the investments [Hosseinabady et al., 2007;Kim & Somani, 2002;Mariani et al., 2007;Mukherjee et al., 2003;Ruiz et al., 2004;Tony et al., 2007;Wang et al., 2004].
The previous literature in estimating the vulnerability and failure rate of systems is based on either the analytical methodology or the fault injection approach at various system modeling levels. The fault injection approach was used to assess the vulnerability of high-performance microprocessors described in Verilog hardware description language at RTL design level [Kim & Somani, 2002;Wang et al., 2004]. The authors of [Mukherjee et al., 2003] proposed a systematic methodology based on the concept of architecturally correct execution to compute the architectural vulnerability factor. [Hosseinabady et al., 2007] and [Tony et al., 2007] proposed the analytical methods, which adopted the concept of timing vulnerability factor and architectural vulnerability factor [Mukherjee et al., 2003] respectively to estimate The authors of [Mariani et al., 2007] presented an innovative failure mode and effects analysis (FMEA) method at SoC-level design in RTL description to design in compliance with IEC61508. The methodology presented in [Mariani et al., 2007] was based on the concept of sensible zone to analyze the vulnerability and to validate the robustness of the target system. A memory sub-system embedded in fault-robust microcontrollers for automotive applications was used to demonstrate the feasibility of their FMEA method. However, the design level in the scheme presented in [Mariani et al., 2007] is RTL level, which may still require considerable time and efforts to implement a SoC using RTL description due to the complexity of oncoming SoC increasing rapidly. A dependability benchmark for automotive engine control applications was proposed in paper [Ruiz et al., 2004]. The work showed the feasibility of the proposed dependability benchmark using a prototype of diesel electronic control unit (ECU) control engine system. The fault injection campaigns were conducted to measure the dependability of benchmark prototype. The domain of application for dependability benchmark specification presented in paper [Ruiz et al., 2004] confines to the automotive engine control systems which were built by commercial off-the-shelf (COTS) components. While dependability evaluation is performed after physical systems have been built, the difficulty of performing fault injection campaign is high and the costs of re-designing systems due to inadequate dependability can be prohibitively expensive.
It is well known that FMEA [Mikulak et al., 2008] and fault tree analysis (FTA) [Stamatelatos et al., 2002] are two effective approaches for the vulnerability analysis of the SoC. However, due to the high complexity of the SoC, the incorporation of the FMEA/FTA and faulttolerant demand into the SoC will further raise the design complexity. Therefore, we need to adopt the behavioral level or higher level of abstraction to describe/model the SoC, such as using SystemC, to tackle the complexity of the SoC design and verification. An important issue in the design of SoC is how to validate the system dependability as early in the development phase to reduce the re-design cost and time-to-market. As a result, a SoC-level safety process is required to facilitate the designers in assessing and enhancing the safety/robustness of a SoC with an efficient manner.
Previously, the issue of SoC-level vulnerability analysis and risk assessment is seldom addressed especially in SystemC transaction-level modeling (TLM) design level [Thorsten et al., 2002;Open SystemC Initiative [OSCI], 2003]. At TLM design level, we can more effectively deal with the issues of design complexity, simulation performance, development cost, fault injection, and dependability for safety-critical SoC applications. In this study, we investigate the effect of soft errors on the SoCs for safety-critical systems. An IP-based SoClevel safety validation and risk reduction (SVRR) process combining FMEA with fault injection scheme is proposed to identify the potential failure modes in a SoC modeled at SystemC TLM design level, to measure the risk scales of consequences resulting from various failure modes, and to locate the vulnerability of the system. A SoC system safety verification platform was built on the SystemC CoWare Platform Architect design environment to demonstrate the core idea of SVRR process. The verification platform comprises a system-level fault injection tool and a vulnerability analysis and risk assessment tool, which were created to assist us in understanding the effect of faults on system 53 behavior, in measuring the robustness of the system, and in identifying the critical parts of the system during the SoC design process under the environment of CoWare Platform Architect.
Since the modeling of SoCs is raised to the level of TLM abstraction, the safety-oriented analysis can be carried out efficiently in early design phase to validate the safety/robustness of the SoC and identify the critical components and failure modes to be protected if necessary. The proposed SVRR process and verification platform is valuable in that it provides the capability to quickly assess the SoC safety, and if the measured safety cannot meet the system requirement, the results of vulnerability analysis and risk assessment will be used to help us develop a feasible and cost-effective risk reduction process. We use an ARM-based SoC to demonstrate the robustness/safety validation process, where the soft errors were injected into the register file of ARM CPU, memory system, and AMBA AHB.
The remaining paper is organized as follows. In Section 2, the SVRR process is presented. A risk model for vulnerability analysis and risk assessment is proposed in the following section. In Section 4, based on the SVRR process, we develop a SoC-level system safety verification platform under the environment of CoWare Platform Architect. A case study with the experimental results and a thorough vulnerability and risk analysis are given in Section 5. The conclusion appears in Section 6.

Safety validation and risk reduction process
We propose a SVRR process as shown in Fig. 1 to develop the safety-critical electronic systems. The process consists of three phases described as follows: Phase 1 (fault hypothesis): this phase is to identify the potential interferences and develop the fault injection strategy to emulate the interference-induced errors that could possibly occur during the system operation.
Phase 2 (vulnerability analysis and risk assessment): this phase is to perform the fault injection campaigns based on the Phase 1 fault hypothesis. Throughout the fault injection campaigns, we can identify the failure modes of the system, which are caused by the faults/errors injected into the system while the system is in operation. The probability distribution of failure modes can be derived from the fault injection campaigns. The riskpriority number (RPN) [Mollah, 2005] is then calculated for the components inside the electronic system. A component's RPN aims to rate the risk of the consequence caused by component's failure. RPN can be used to locate the critical components to be protected. The robustness of the system is computed based on the adopted robustness criterion, such as safety integrity level (SIL) defined in the IEC 61508 [IEC, 1998[IEC, -2000. If the robustness of the system meets the safety requirement, the system passes the validation; else the robustness/safety is not adequate, so Phase 3 is activated to enhance the system robustness/safety. Phase 3 (fault-tolerant design and risk reduction): This phase is to develop a feasible riskreduction approach by fault-tolerant design, such as the schemes presented in [Austin, 1999;Mitra et al., 2005;Rotenberg, 1999;Slegel et al., 1999;], to improve the robustness of the critical components identified in Phase 2. The enhanced version then goes to Phase 2 to recheck whether the adopted risk-reduction approach can satisfy the safety/robustness requirement or not.

Vulnerability analysis and risk assessment
Analyzing the vulnerability of SoCs or systems can help designers not only invest limited resources on the most crucial region but also understand the gain derived from the investment. In this section, we propose a SoC-level risk model to quickly assess the SoC's vulnerability at SystemC TLM level. Conceptually, our risk model is based on the FMEA method with the fault injection approach to measure the robustness of SoCs. From the assessment results, the rank of component vulnerability related to the risk scale of causing the system failure can be acquired. The notations used in the risk model are developed below.

Fault hypothesis
It is well known that the rate of soft errors caused by single event upset (SEU) increases rapidly while the chip fabrication enters the very deep submicron technology [Baumann, 2005;Constantinescu, 2002;Karnik et al., 2004;Zorian et al., 2005]. Radiation-induced soft errors could cause a serious dependability problem for SoCs, electronic control units, and nodes used in the safety-critical applications. The soft errors may happen in the flip-flop, register file, memory system, system bus and combinational logic. In this work, single soft error is considered in the derivation of risk model.

Risk model
The potential effects of faults on SoC can be identified from the fault injection campaigns. We can inject the faults into a specific component, and then investigate the effect of component's errors on the SoC behaviors. Throughout the injection campaigns for each component, we can identify the failure modes of the SoC, which are caused by the errors of components in the SoC. The parameter P(i, FM(k)) defined before can be derived from the fault injection campaigns.
In general, the following failure behaviors: fatal failure (FF), such as system crash or process hang, silent data corruption (SDC), correct data/incorrect time (CD/IT), and infinite loop (IL) (note that we declare the failure as IL if the execution of benchmark exceeds the 1.5 times of normal execution time), which were observed from our previous work, represent the possible SoC failure modes caused by the faults occurring in the components. Therefore, we adopt those four SoC failure modes in this study to demonstrate our risk assessment approach. We note that a fault may not cause any trouble at all, and this phenomenon is called no effect of the fault.
One thing should be pointed out that to obtain the highly reliable experimental results to analyze the robustness/safety and vulnerability of the target system we need to perform the adequate number of fault injection campaigns to guarantee the validity of the statistical data obtained. In addition, the features of benchmarks could also affect the system response to the faults. Therefore, several representative benchmarks are required in the injection campaigns to enhance the confidence level of the statistical data.
In the derivation of P(i, FM(K)), we need to perform the fault injection campaigns to collect the fault simulation data. Each fault injection campaign represents an experiment by injecting a fault into the i th component, and records the fault simulation data, which will be used in the failure mode classification procedure to identify which failure mode or no effect the SoC encountered in this fault injection campaign. The failure mode classification procedure inputs the fault-free simulation data, and fault simulation data derived from the fault injection campaigns to analyze the effect of faults occurring in the i th component on the SoC behavior based on the classification rules for potential failure modes.
The derivation process of P(i, FM(K)) by fault injection process is described below. Several notations are developed first: The failure mode classification procedure is used to classify the SoC failure modes caused by the component's faults. For a specific benchmark program, we need to perform a fault-free simulation to acquire the golden results that are used to assist the failure mode classification procedure in identifying which failure mode or no effect the SoC encountered in this fault injection campaign.

Failure mode classification procedure:
Inputs: fault-free simulation golden data and fault simulation data for an injection campaign; Output: SoC failure mode caused by the component's fault or no effect of the fault in this injection campaign. } After carrying out the above injection experiments, the parameter of P(i, FM(K)) can be computed by The following expressions are exploited to evaluate the terms of P(i, SF) and P(i, NE).
The derivation of the component's raw error rate is out of the scope of this paper, so we here assume the data of ER_C(i), for 1  i  n, are given. The part of SoC failure rate contributed from error rate of the i th component can be calculated by If each component C(i), 1  i  n, must operate correctly for the SoC to operate correctly and also assume that other components not shown in C(i) list are fault-free, the SoC failure rate can be written as The meaning of the parameter SR_FM(k) and the role it playing can be explained from the aspect of FMEA process [Mollah, 2005]. The method of FMEA is to identify all possible failure modes of a SoC and analyze the effects or consequences of the identified failure modes. In general, an FMEA records each potential failure mode, its effect in the next level, and the cause of failure. We note that the faults occurring in different components could cause the same SoC failure mode, whereas the severity degree of the consequences resulting from various SoC failure modes could not be identical. The parameter SR_FM(k) is exploited to express the severity rate of the consequence resulting from the k th failure mode, where 1  k  z.
We illustrate the risk evaluation with FMEA idea using the following example. An ECU running engine control software is employed for automotive engine control. Its outputs are used to control the engine operation. The ECU could encounter several types of output failures due to hardware or software faults in ECU. The various types of failure mode of ECU outputs would result in different levels of risk/criticality on the controlled engine. A risk assessment is performed to identify the potential failure modes of ECU outputs as well as the likelihood of failure occurrence, and estimate the resulting risks of the ECU-controlled engine.
In the following, we propose an effective SoC-level FMEA method to assess the risk-priority number (RPN) for the components inside the SoC and for the potential SoC failure modes. A component's RPN aims to rate the risk of the consequences caused by component's faults. In other words, a component's RPN represents how serious is the impact of component's errors on the system safety. A risk assessment should be carried out to identify the critical components within a SoC and try to mitigate the risks caused by those critical components. Once the critical components and their risk scales have been identified, the risk-reduction process, for example fault-tolerant design, should be activated to improve the system dependability. RPN can also give the protection priority among the analyzed components. As a result, a feasible risk-reduction approach can be developed to effectively protect the vulnerable components and enhance the system robustness and safety.
The parameter RPN_C(i), i.e. risk scale of failures occurring in the i th component, can be computed by The expression of RPN_C(i) contains three terms which are, from left to right, error rate of the i th component, probability of FM(K) if a fault occurs in the i th component, and severity rate of the k th failure mode. As stated previously, a component's fault could result in several different system failure modes, and each identified failure mode has its potential impact on the system safety. So, RPN_C(i) is the summation of the following expression ER_C(i)  P (i, FM(K))  SR_FM(k), for k from one to z. The term of ER_C(i)  P (i, FM(K)) represents the occurrence rate of the k th failure mode, which is caused by the i th component failing to perform its intended function.
The RPN_FM(k) represents the risk scale of the k th failure mode, which can be calculated by expresses the occurrence rate of the k th failure mode in a SoC. This sort of assessment can reveal the risk levels of the failure modes to its system and identify the major failure modes for protection so as to reduce the impact of failures to the system safety.

System safety verification platform
We have created an effective safety verification platform to provide the capability to quickly handle the operation of fault injection campaigns and dependability analysis for the system design with SystemC. The core of the verification platform is the fault injection tool [Chang & Chen, 2007;Chen et al., 2008] under the environment of CoWare Platform Architect [CoWare, 2006], and the vulnerability analysis and risk assessment tool. The tool is able to deal with the fault injection at the following levels of abstraction [Chang & Chen, 2007;Chen et al., 2008]: bus-cycle accurate level, untimed functional TLM with primitive channel sc_fifo, and timed functional TLM with hierarchical channel. An interesting feature of our fault injection tool is to offer not only the time-triggered but also the event-triggered methodologies to decide when to inject a fault. Consequently, our injection tool can significantly reduce the effort and time for performing the fault injection campaigns.
Combining the fault injection tool with vulnerability analysis and risk assessment tool, the verification platform can dramatically increase the efficiency of carrying out the system robustness validation and vulnerability analysis and risk assessment. For the details of our fault injection tool, please refer to [Chang & Chen, 2007;Chen et al., 2008].
However, the IP-based SoCs designed by CoWare Platform Architect in SystemC design environment encounter the injection controllability problem. The simulation-based fault injection scheme cannot access the fault targets inside the IP components imported from other sources. As a result, the injection tool developed in SystemC abstraction level may lack the capability to inject the faults into the inside of the imported IP components, such as CPU or DSP. To fulfill this need, we exploit the software-implemented fault injection scheme [Sieh, 1993;Kanawati et al., 1995] to supplement the injection ability. The softwareimplemented fault injection scheme, which uses the system calls of Unix-type operating system to implement the injection of faults, allows us to inject the faults into the targets of storage elements in processors, like register file in CPU, and memory systems. As discussed, a complete IP-based SoC system-level fault injection tool should consist of the softwareimplemented and simulation-based fault injection schemes.
Due to the lack of the support of Unix-type operating system in CoWare Platform Architect, the current version of safety verification platform cannot provide the software-implemented fault injection function in the tool. Instead, we employed a physical system platform built by ARM-embedded SoC running Linux operating system to validate the developed softwareimplemented fault injection mechanism. We note that if the CoWare Platform Architect can support the UNIX-type operating system in the SystemC design environment, our softwareimplemented fault injection concept should be brought in the SystemC design platform. Under the circumstances, we can implement the so called hybrid fault injection approach, which comprises the software-implemented and simulation-based fault injection methodologies, in the SystemC design environment to provide more variety of injection functions.

Case study
An ARM926EJ-based SoC platform provided by CoWare Platform Architect [CoWare, 2006] was used to demonstrate the feasibility of our risk model. The illustrated SoC platform was modeled at the timed functional TLM abstraction level. This case study is to investigate three important components, which are register file in ARM926EJ, AMBA Advanced Highperformance Bus (AHB), and the memory sub-system, to assess their risk scales to the SoCcontrolled system. We exploited the safety verification platform to perform the fault injection process associated with the risk model presented in Section 3 to obtain the riskrelated parameters for the components mentioned above. The potential SoC failure modes classified from the fault injection process are fatal failure (FF), silent data corruption (SDC), correct data/incorrect time (CD/IT), and infinite loop (IL). In the following, we summarize the data used in this case study.

AMBA AHB experimental results
The system bus, such as AMBA AHB, provides an interconnected platform for IP-based SoC.
Apparently, the robustness of system bus plays an important role in the SoC reliability. It is evident that the faults happening in the bus signals will lead to the data transaction errors and finally cause the system failures. From Table 1, it is evident that the susceptibility of the SoC to bus faults is benchmarkdependent and the rank of system bus vulnerability over different benchmarks is JPEG > M-M > FFT > QS. However, all benchmarks exhibit the same trend in that the probabilities of FF show no substantial difference, and while a fault arises in the bus signals, the occurring probabilities of SDC and FF occupy the top two ranks. The results of the last row offer the average statistics over four benchmarks employed in the fault injection process. Since the probabilities of SoC failure modes are benchmark-variant, the average results illustrated in Table 1 give us the expected probabilities for the system bus vulnerability of the developing SoC, which are very valuable for us to gain the robustness of the system bus and the probability distribution of failure modes. The robustness measure of the system bus is only 26.78% as shown in Table 1, which means that a fault occurring in the system bus, the SoC has the probability of 26.78% to survive for that fault.
The experimental results shown in Table 2 are probability distribution of failure modes with respect to the various bus signal errors for the used benchmarks. From the data illustrated in the NE column, we observed that the most vulnerable part is the address bus HADDR[31:0]. Also from the data displayed in the FF column, the faults occurring in address bus will have the probability between 38.9% and 42.3% to cause a serious fatal failure for the used benchmarks. The HSIZE and HDATA signal errors mainly cause the SDC failure. In summary, our results reveal that the address bus HADDR should be protected first in the design of system bus, and the SDC is the most popular failure mode for the demonstrated SoC responding to the bus faults or errors.  Table 2. Probability distribution of failure modes with respect to various bus signal errors for the used benchmarks (1, 2, 3 and 4 represent the jpeg, m-m, fft and qs benchmark, respectively).

Memory sub-system experimental results
The memory sub-system could be affected by the radiation articles, which may cause the bitflipped soft errors. However, the bit errors won't cause damage to the system operation if one of the following situations occurs:


Situation 1: The benchmark program never reads the affected words after the bit errors happen.  Situation 2: The first access to the affected words after the occurrence of bit errors is the 'write' action.
Otherwise, the bit errors could cause damage to the system operation. Clearly, if the first access to the affected words after the occurrence of bit errors is the 'read' action, the bit errors will be propagated and could finally lead to the failures of SoC operation. So, whether the bit errors will become fatal or not, it all depends on the occurring time of bit errors, the locations of affected words, and the benchmark's memory access patterns after the occurrence of bit errors.
According to the above discussion, two interesting issues arise; one is the propagation probability of bit errors and another is the failure probability of propagated bit errors. We define the propagation probability of bit errors as the probability of bit errors which will be read out and propagated to influence the execution of the benchmarks. The failure probability of propagated bit errors represents the probability of propagated bit errors which will finally result in the failures of SoC operation.
Initially, we tried performing the fault injection campaigns in the CoWare Platform Architect to collect the simulation data. After a number of fault injection and simulation campaigns, we realized that the length of experimental time will be a problem because a huge amount of fault injection and simulation campaigns should be conducted for each benchmark and several benchmarks are required for the experiments. From the analysis of the campaigns, we observed that a lot of bit-flip errors injected to the memory sub-system fell into the Situation 1 or 2, and therefore, we must carry out an adequate number of fault injection campaigns to obtain the validity of the statistical data.
To solve this dilemma, we decide to perform two types of experiments termed as Type 1 experiment and Type 2 experiment, or called hybrid experiment, to assess the propagation probability and failure probability of bit errors, respectively. As explained below, Type 1 experiment uses a software tool to emulate the fault injection and simulation campaigns to quickly gain the propagation probability of bit errors, and the set of propagated bit errors. The set of propagated bit errors will be used in the Type 2 experiment to measure the failure probability of propagated bit errors.
Type 1 experiment: we develop the experimental process as described below to measure the propagation probability of bit errors. The following notations are used in the experimental process.

Experimental Process:
We injected a bit-flipped error into a randomly chosen memory address at random read/write transaction time for each injection campaign. As stated earlier, this bit error could either be propagated to the system or not. If yes, then we add one to the parameter C p-b-err . The parameter N p-b-err is set by users and employed as the terminated condition for the current benchmark's experiment. When the value of C p-b-err reaches to N p-berr , the process of current benchmark's experiment is terminated. The P p-b-err can then be derived from N p-b-err divided by N inj . The values of N bench , S m and N p-b-err are given before performing the experimental process.
for j = 1 to N bench { Step 1: Run the j th benchmark in the experimental SoC platform under CoWare Platform Architect to collect the desired bus read/write transaction information that include address, data and control signals of each data transaction into an operational profile during the program execution. The value of N d-t can be obtained from this step.
Step 2: C p-b-err = 0; N inj (j) = 0; While C p-b-err < N p-b-err do { T error can be decided by randomly choosing a number x between one and N d-t . It means that T error is equivalent to the time of the x th data transaction occurring in the memory sub-system. Similarly, A error is determined by randomly choosing an address between one and S m . A bit is randomly picked up from the word pointed by A error , and the bit selected is flipped. Here, we assume that the probability of fault occurrence of each word in memory sub-system is the same.
If ((Situation 1 occurs) or (Situation 2 occurs)) then {the injected bit error won't cause damage to the system operation;} else {C p-b-err = C p-b-err + 1; record the related information of this propagated bit error to S p-b-err (j) including T error , A error and bit location.} //Situation 1 and 2 are described in the beginning of this Section. The operational profile generated in Step 1 is exploited to help us investigate the resulting situation caused by the current bit error. From the operational profile, we check the memory access patterns beginning from the time of occurrence of bit error to identify which situation the injected bit error will lead to. // N inj (j) = N inj (j) + 1;} } For each benchmark, we need to perform the Step 1 of Type 1 experimental process once to obtain the operational profile, which will be used in the execution of Step 2. We then created a software tool to implement the Step 2 of Type 1 experimental process. We note that the created software tool emulates the fault injection campaigns required in Step 2 and checks the consequences of the injected bit errors with the support of operational profile derived from Step 1. It is clear to see that the Type 1 experimental process does not utilize the simulation-based fault injection tool implemented in safety verification platform as described in Section 4. The reason why we did not exploit the safety verification platform in this experiment is the consideration of time efficiency. The comparison of required simulation time between the methodologies of hybrid experiment and the pure simulationbased fault injection approach implemented in CoWare Platform Architect will be given later.
The Type 1 experimental process was carried out to estimate P p-b-err , where N bench , S m and N p-berr were set as the values of 4, 524288, and 500 respectively. Table 3 shows the propagation probability of bit errors for four benchmarks, which were derived from a huge amount of fault injection campaigns to guarantee their statistical validity. It is evident that the propagation probability is benchmark-variant and a bit error in memory would have the probability between 0.866% and 3.551% to propagate the bit error from memory to system. The results imply that most of the bit errors won't cause damage to the system. We should emphasize that the size of memory space and characteristics of the used benchmarks (such as amount of memory space use and amount of memory read/write) will affect the result of P p-b-err . Therefore, the data in Table 3 reflect the results for the selected memory space and benchmarks.
Type 2 experiment: From Type 1 experimental process, we collect N p-b-err bit errors for each benchmark to the set S p-b-err (j). Those propagated bit errors were used to assess the failure probability of propagated bit errors. Therefore, N p-b-err simulation-based fault injection Benchmark N inj N p-b-err P p-b-err M-M 14079 500 3.551% QS 23309 500 2.145% JPEG 27410 500 1.824% FFT 57716 500 0.866% Table 3. Propagation probability of bit errors.
campaigns were conducted under CoWare Platform Architect, and each injection campaign injects a bit error into the memory according to the error scenarios recorded in the set S p-berr (j). Therefore, we can examine the SoC behavior for each injected bit error.
As can be seen from Table 3, we need to conduct an enormous amount of fault injection campaigns to reach the expected number of propagated bit errors. Without the use of Type 1 experiment, we need to utilize the simulation-based fault injection approach to assess the propagation probability and failure probability of bit errors as illustrated in Table 3, 5, and 6, which require a huge number of simulation-based fault injection campaigns to be conducted. As a result, an enormous amount of simulation time is required to complete the injection and simulation campaigns. Instead, we developed a software tool to implement the experimental process described in Type 1 experiment to quickly identify which situation the injected bit error will lead to. Using this approach, the number of simulation-based fault injection campaigns performed in Type 2 experiment decreases dramatically. The performance of software tool adopted in Type 1 experiment is higher than that of simulation-based fault injection campaign employed in Type 2 experiment. Therefore, we can save a considerable amount of simulation time.
The data of Table 3 indicate that without the help of Type 1 experiment, we need to carry out a few ten thousand simulation-based fault injection campaigns in Type 2 experiment. As opposite to that, with the assistance of Type 1 experiment, only five hundred injection campaigns are required in Type 2 experiment. Table 4 gives the experimental time of the Type 1 plus Type 2 approach and pure simulation-based fault injection approach, where the data in the column of ratio are calculated by the experimental time of Type 1 plus Type 2 approach divided by the experimental time of pure simulation-based approach. The experimental environment consists of four machines to speed up the validation, where each machine is equipped with Intel® Core™2 Quad Processor Q8400 CPU, 2G RAM, and CentOS 4.6. In the experiments of Type 1 plus Type 2 approach and pure simulation-based approach, each machine is responsible for performing the simulation task for one benchmark. According to the simulation results, the average execution time for one simulation-based fault injection experiment is 14.5 seconds. It is evident that the performance of Type 1 plus Type 2 approach is quite efficient compared to the pure simulation-based approach because Type 1 plus Type 2 approach employed a software tool to effectively reduce the number of simulation-based fault injection experiments to five hundred times compared to a few ten thousand simulation-based fault injection experiments for pure simulation-based approach.
Given N p-b-err and S p-b-err (j), i.e. five hundred simulation-based fault injection campaigns, the Type 2 experimental results are illustrated in Table 5. From Table 5, we can identify the potential failure modes and the distribution of failure modes for each benchmark. It is clear that the susceptibility of a system to the memory bit errors is benchmark-variant, and the M-M is the most critical benchmark among the four adopted benchmarks, according to the results of Table 5.
We then manipulated the data of Table 3 and 5 to acquire the results of Table 6. Table 6 shows the probability distribution of failure modes if a bit error occurs in the memory subsystem. Each datum in the row of 'Avg.' was obtained by mathematical average of the benchmarks' data in the corresponding column. This table offers the following valuable information: the robustness of memory sub-system, the probability distribution of failure modes and the impact of benchmark on the SoC dependability. Probability of SoC failure for a bit error occurring in the memory is between 0.738% and 3.438%. We also found that the SoC has the highest probability to encounter the SDC failure mode for a memory bit error. In addition, the vulnerability rank of benchmarks for memory bit errors is M-M > QS > JPEG > FFT. Table 7 illustrates the statistics of memory read/write for the adopted benchmarks. The results of Table 7 confirm the vulnerability rank of benchmarks as observed in Table 6. Situation 2 as mentioned in the beginning of this section indicates that the occurring probability of Situation 2 increases as the probability of performing the memory write operation increases. Consequently, the robustness of a benchmark rises with an increase in the probability of Situation 2.  Table 7. The statistics of memory read/write for the used benchmarks.

Register file experimental results
The ARM926EJ CPU used in the experimental SoC platform is an IP provided from CoWare Platform Architect. Therefore, the proposed simulation-based fault injection approach has a limitation to inject the faults into the register file inside the CPU. This problem can be solved by software-implemented fault injection methodology as described in Section 4. Currently, we cannot perform the fault injection campaigns in register file under CoWare Platform Architect due to lack of the operating system support. We note that the literature [Leveugle et al., 2009;Bergaoui et al., 2010] have pointed out that the register file is vulnerable to the radiation-induced soft errors. Therefore, we think the register file should be taken into account in the vulnerability analysis and risk assessment. Once the critical registers are located, the SEU-resilient flip-flop and register design can be exploited to harden the register file. In this experiment, we employed a similar physical system platform built by ARM926EJ-embedded SoC running Linux operating system 2.6.19 to derive the experimental results for register file.
The register set in ARM926EJ CPU used in this experiment is R0 ~ R12, R13 (SP), R14 (LR), R15 (PC), R16 (CPSR), and R17 (ORIG_R0). A fault injection campaign injects a single bit-flip fault to the target register to investigate its effect on the system behavior. For each benchmark, we performed one thousand fault injection campaigns for each target register by randomly choosing the time instant of fault injection within the benchmark simulation duration, and randomly choosing the target bit to inject 1-bit flip fault. So, eighteen thousand fault injection campaigns were carried out for each benchmark to obtain the data shown in Table 8. From Table 8, it is evident that the susceptibility of the system to register faults is benchmark-dependent and the rank of system vulnerability over different benchmarks is QS > FFT > M-M. However, all benchmarks exhibit the same trend in that while a fault arises in the register set, the occurring probabilities of CD/IT and FF occupy the top two ranks. The robustness measure of the register file is around 74% as shown in  Table 9. Statistics of SoC failure probability for each target register with various benchmarks. Table 9 illustrates the statistics of SoC failure probability for each target register under the used benchmarks. Throughout this table, we can observe the vulnerability of each register for different benchmarks. It is evident that the vulnerability of registers quite depends on the characteristics of the benchmarks, which could affect the read/write frequency and read/write syndrome of the target registers. The bit errors won't cause damage to the system operation if one of the following situations occurs:  Situation 1: The benchmark never uses the affected registers after the bit errors happen.  Situation 2: The first access to the affected registers after the occurrence of bit errors is the 'write' action.
It is apparent to see that the utilization and read frequency of R4 ~ R8 and R14 for benchmark M-M is quite lower than FFT and QS, so the SoC failure probability caused by the errors happening in R4 ~ R8 and R14 for M-M is significantly lower than FFT and QS as illustrated in Table 9. We observe that the usage and write frequency of registers, which reflects the features and the programming styles of benchmark, dominates the soft error sensitivity of the registers. Without a doubt, the susceptibility of register R15 (program counter) to the faults is 100%. It indicates that the R15 is the most vulnerable register to be protected in the register set. Fig. 2 illustrates the average SoC failure probabilities for the registers R0 ~ R17, which are derived from the data of the used benchmarks as exhibited in Table 9. According to Fig. 2, the top three vulnerable registers are R15 (100%), R14 (68.4%), as well as R13 (31.1%), and the SoC failure probabilities for other registers are all below 30%. Fig. 2. The average SoC failure probability from the data of the used benchmarks.

SoC-level vulnerability analysis and risk assessment
According to IEC 61508, if a failure will result in a critical effect on system and lead human's life to be in danger, then such a failure is identified as a dangerous failure or hazard. IEC 61508 defines a system's safety integrity level (SIL) to be the Probability of the occurrence of a dangerous Failure per Hour (PFH) in the system. For continuous mode of operation (high demand rate), the four levels of SIL are given in Table 10 [IEC , 1998-2000].
In this case study, three components, ARM926EJ CPU, AMBA AHB system bus and memory sub-system, were utilized to demonstrate the proposed risk model to assess the scales of failure-induced risks in a system. The following data are used to show the vulnerability analysis and risk assessment for the selected components {C (1) According to the expressions presented in Section 3 and the results shown in Section 5.1 to 5.3, the SoC failure rate, SIL and RPN are obtained and illustrated in Table 11, 12 and 13.
We should note that the components' error rates used in this case study are only for the demonstration of the proposed robustness/safety validation process, and the more realistic components' error rates for the considered components should be determined by process and circuit technology [Mukherjee et al., 2003]. According to the given components' error rates, the data of SFR in Table 11 can be used to assess the safety integrity level of the system. One thing should be pointed out that a SoC failure may or may not cause the dangerous effect on the system and human life. Consequently, a SoC failure could be classified into safe failure or dangerous failure. To simplify the demonstration, we make an assumption in this assessment that the SoC failures caused by the faults occurring in the components are always the dangerous failures or hazards. Therefore, the SFR in Table 11 is used to approximate the PFH, and so the SIL can be derived from Table 10.
With respect to safety design process, if the current design does not meet the SIL requirement, we need to perform the risk reduction procedure to lower the PFH, and in the meantime to reach the SIL requirement. The vulnerability analysis and risk assessment can be exploited to identify the most critical components and failure modes to be protected. In such approach, the system safety can be improved efficiently and economically.
Based on the results of RPN_C(i) as exhibited in Table 12, for i = 1, 2, 3, it is evident that the error of AMBA AHB is more critical than the errors of register set and memory sub-system. So, the results suggest that the AHB system bus is more urgent to be protected than the register set and memory. Moreover, the data of RPN_FM(k) in Table 13, k from one to four, infer that SDC is the most crucial failure mode in this illustrated example. Throughout the above vulnerability and risk analyses, we can identify the critical components and failure modes, which are the major targets for design enhancement. In this demonstration, the top priority of the design enhancement is to raise the robustness of the AHB HADDR bus signals to significantly reduce the rate of SDC and the scale of system risk if the system reliability/safety is not adequate.

Conclusion
Validating the functional safety of system-on-chip (SoC) in compliance with international standard, such as IEC 61508, is imperative to guarantee the dependability of the systems before they are being put to use. It is beneficial to assess the SoC robustness in early design phase in order to significantly reduce the cost and time of re-design. To fulfill such needs, in this study, we have presented a valuable SoC-level safety validation and risk reduction process to perform the hazard analysis and risk assessment, and exploited an ARM-based SoC platform to demonstrate its feasibility and usefulness. The main contributions of this study are first to develop a useful SVRR process and risk model to assess the scales of robustness and failure-induced risks in a system; second to raise the level of dependability validation to the untimed/timed functional TLM, and to construct a SoC-level system safety verification platform including an automatic fault injection and failure mode classification tool on the SystemC CoWare Platform Architect design environment to demonstrate the core idea of SVRR process. So the efficiency of the validation process is dramatically increased; third to conduct a thorough vulnerability analysis and risk assessment of the register set, AMBA bus and memory sub-system based on a real ARM-embedded SoC.
The analyses help us measure the robustness of the target components and system safety, and locate the critical components and failure modes to be guarded. Such results can be used to examine whether the safety of investigated system meets the safety requirement or not, and if not, the most critical components and failure modes are protected by some effective risk reduction approaches to enhance the safety of the investigated system. The vulnerability analysis gives a guideline for prioritized use of robust components. Therefore, the resources can be invested in the right place, and the fault-robust design can quickly achieve the safety goal with less cost, die area, performance and power impact.

Acknowledgment
The author acknowledges the support of the National Science Council, R.O.C., under Contract No. NSC 97-2221-E-216-018 and NSC 98-2221-E-305-010. Thanks are also due to the