A Petri Net-Based Approach to the Quantification of Data Center Dependability

Data center availability and reliability have accomplished greater concern due to increased dependence on Internet services (e.g., Cloud computing paradigm, social networks and e-commerce). For companies that heavily depend on the Internet for their operations, service outages can be very expensive, easily running into millions of dollars per hour [15]. A widely used design principle in fault-tolerance is to introduce redundancy to enhance availability. However, since redundancy leads to additional use of resources and energy, it is expected to have a negative impact on sustainability and the associated cost.


Introduction
Data center availability and reliability have accomplished greater concern due to increased dependence on Internet services (e.g., Cloud computing paradigm, social networks and e-commerce). For companies that heavily depend on the Internet for their operations, service outages can be very expensive, easily running into millions of dollars per hour [15]. A widely used design principle in fault-tolerance is to introduce redundancy to enhance availability. However, since redundancy leads to additional use of resources and energy, it is expected to have a negative impact on sustainability and the associated cost.
Data center designers need to verify several trade-offs and select the feasible solution considering dependability metrics. In this context, formal models (e.g., Stochastic Petri nets and Reliability Block Diagrams) are important to provide estimates before implementing the data center system. Additionally, a growing concern of data center designers is related to the identification of components that may cause system failure as well as systems parts that must be improved before implementing the architecture.
In this work, we propose a set of formal models for quantifying dependability metrics for data center power infrastructures. The adopted approach takes into account a hybrid modeling technique that considers the advantages of both stochastic Petri nets (SPN) [22] and reliability block diagrams (RBD) [10] to evaluate system dependability. An integrated environment, namely, ASTRO [20] has been developed as one of the results of this work to automate dependability evaluation of data center architectures.

Preliminaries
This section briefly touches some fundamental concepts as a basis for a better understanding of this work. This class of Petri net has two kinds of nodes, called places (P) represented by circles and transitions (T) represented by bars, such that P ∩ T = and P ∪ T = . Figure 1 depicts the basic elements of a simple PN. The set of arcs F is used to denote the places connected to a transition (and vice-versa). W is a weight function for the set of arcs. In this case, each arc is said to have multiplicity k, where k represents the respective weight of the arc. Figure 2 shows multiple arcs connecting places and transitions in a compact way by a single arc labeling it with its weight or multiplicity k. Places and transitions may have several interpretations. Using the concept of conditions and events, places represent conditions, and transitions represent events, such that, an event may have several pre-conditions and post-conditions. For more interpretations, Table 1 shows other meanings for places and transitions [14].  It is important to show that there are another way to represent PN's elements. As an example, the set of input and output places of transitions is shown in Definition 2.2. Similarly, the set of input and output transitions of determinate place is shown in Definition 2.3.

Definition 2.2. (Input and Output Transitions of a place)
The set of input transitions (also called pre-set) of a place p i ∈ P is: and the set of output transitions (also called post-set) is:

Definition 2.3. (Input and output places of a transition)
The set of input places of a transition t j ∈ T is: and the set of output places of a transition t j ∈ T is:

Marked Petri nets
A marking (also named token) has a primitive concept in PNs such as place and transitions. Markings are information attributed to places; the number and mark distributions consist of the net state in determined moment. The formal definitions are presented as follows. A marked Petri net contains tokens, which reside in places, travel along arcs, and their flow through the net is regulated by transitions. A peculiar distribution (M) of the tokens in the places, represents a specific state of the system. These tokens are denoted by black dots inside the places as shown in Figure 1 (d).

Transition enabling and firing
The behavior of many systems can be described in terms of system states and their changes. In order to simulate the dynamic behavior of a system, a state (or marking) in a Petri net is changed according to the following firing rule: 1. A transition t is said to be enabled, if each input place p of t is marked with at least the number of tokens equal to the multiplicity of its arc connecting p with t. Adopting a mathematical notation, an enabled transition t for given marking m i is denoted by 2. An enabled transition may or may not fire (depending on whether or not the respective event takes place).
3. The firing of an enabled transition t removes tokens (equal to the multiplicity of the input arc) from each input place p, and adds tokens (equal to the multiplicity of the output arc) to each output place p . Using a mathematical notation, the firing of a transition is represented by the equation Figure 3 (a) shows the mathematical representation of a Petri net model with three places (p 0 , p 1 , p 2 ) and one transition (t 0 ). Besides, there is one arc connecting the place p 0 to the transition t 0 with weight two, one arc from the place p 1 to the transition t 0 with weight one, and one arc connecting the transition t 0 to the place p 2 with weight two. The initial marking (m 0 ) is represented by three tokens in the place p 0 and one token in the place p 1 . Figure 3 (b) outlines its respective graphical representation, and Figure 3 (c) provides the same graphical representation after the firing of t 0 . For this example, the set of reachable markings is m = {m 0 = (3, 1, 0), m 1 = (1, 0, 2)}. The marking m 1 was obtained by firing t 0 , such that, m 1 (p 0 ) = 3 -2 + 0, m 1 (p 1 ) = 1 -1 + 0, and m 1 (p 2 ) = 0 -0 + 2. There are two particular cases which the firing rule happens differently. The first one is a transition without any input place that is called as a source transition, and the other one is a transition without any output place, named sink transition. A source transition is unconditionally enabled, and the firing of a sink transition consumes tokens, but does not produce any. Figure 4 Figure   6 shows a self-loop net.

Elementary structures
Elementary nets are used as building blocks in the specification of more complex applications.

Sequence
Sequence structure represents sequential execution of actions, provided that a condition is satisfied. After the firing of a transition, another transition is enabled to fire. Figure 7(a) depicts an example of this structure in which a mark in place p 0 enables the transition t 0 . The firing of transition t 0 enables the transition t 1 (p 1 is marked). Figure 7(b) shows an example of a fork structure that allows the creation of parallel processes.

Join
Generally, concurrent activities need to synchronize with each other. This net (Figure 7(c)) combines two or more nets, allowing that another process continues this execution only after the end of predecessor processes. Figure 7(d) depicts a choice model, in which the firing of the transition t0 disables the transition t 1 . This building block is suited for modeling if-then-else statement, for instance.

Merging
The merging is an elementary net that allows the enabling of the same transition by two or more processes. Figure 7(e) shows a net with two independent transitions (t 0 and t 1 ) that have an output place in common (P2). Therefore, firing of any of these two transitions, a condition is created (p 2 is marked) which allows the firing of another transition (not shown in the figure).

Confusions
The mixing between conflict and concurrency is called confusion. While conflict is a local phenomenon in the sense that only the pre-sets of the transitions with common input places are involved, confusion involves firing sequences. Figure 8 depicts two types of confusions: (a) symmetric confusion, where two transitions t 1 and t 3 are concurrent while each one is in conflict with transition t 2 ; and (b) asymmetric confusion, where t 1 is concurrent with t 2 , but will be in conflict with t 3 if t 2 fires first.

Petri nets modeling examples
In this section, several simple examples are given in order to introduce how to model some basic concepts such as parallel process and mutual exclusion in Petri nets.

Parallel processes
In order to represent parallel processes, a model may be obtained by composing the model for each individual process with a fork and synchronization models. Two transitions are said to be parallel (or concurrent), if they are causally independent, i.e., one transition may fire either before (or after) or in parallel with the other. Figure 9 depicts an example of parallel process, where transitions t 1 and t 2 represent parallel activities. When transition t 0 fires, it creates marks in both output places (p 0 and p 1 ), representing a concurrency. When t 1 and t 2 are enabled for firing, each one may fire independently. The firing of t 3 depends on two pre-conditions, p 2 and p 3 , implying that the system only continues if t 1 and t 2 have been fired. Figure 9 presents a net in which each place has exactly one incoming arc and exactly one outgoing arc. Thus, such model represents a sub-class of Petri nets known as marked graphs.
Marked graphs allow representation of concurrency but not decisions or conflicts.

Mutual exclusion
The sharing of resources and/or data are common in many system applications, in which most of resources and data should be accessed in a mutual exclusive way. Resources (or data variable) may be modeled by a place with tokens representing the amount of resources. This place is seen as pre-conditions for all transitions that need such resource. After the use of one resource, it must be released. Figure 10 depicts an example of a machine that is accessed in a mutual exclusive way.

Dataflow computation
Petri nets can be used to represent not only the control-flow but also the data-flow. The net shown in Figure 11 is a Petri net representation of a dataflow computation. A dataflow is characterized by the concurrent instruction execution (or transitions firing) as soon as the operands (pre-conditions) are available. In the Petri net representation, tokens may denote values of current data as well as the availability of data. The instructions are represented by transitions such as Add and Subtract that can be executed in parallel. After that, if the activity Subtract has computed a result different from zero, meaning that the pre-conditions to perform divide operation were satisfied. Afterwards, when the transition divide occur, the dataflow computation is completed. Figure 11. Dataflow example.

Petri nets properties
The PN properties allow a detailed analysis of the modeled system. For this, two types of properties have been considered in a Petri net model: behavioral and structural properties. Behavioral properties are those which depend on the initial marking. Structural properties, on the other hand, are those that are marking-independent.

Behavioral properties
This section, based on [14], describes some behavioral properties, since such properties are very important when analyzing a given system.

Reachability
The firing of an enabled transition changes the token marking in a Petri net, and a sequence of firings results in a sequence of markings. A marking M n is said to be reachable from a marking M 0 if there exists a sequence of firings that transforms M 0 to M n .
A firing (or occurrence) sequence is denoted by σ = t 1 , t 2 , ..., t n . In this case, m i is

Boundedness
A Petri net is said to be bounded if the number of tokens in each place does not exceed a finite number k for any marking reachable from M 0 . In a formal way, M(p) ≤ k, ∀p ∈ P and ∀M ∈ R(M 0 ).

Safe
When the number of tokens in each place does not exceed the number "1" (one), such Petri net is said to be safe. It is important to state that if a net is bounded or safe, it is guaranteed that there will be no overflows in any place, no matter the firing sequence adopted.

Deadlock freedom
A PN is said to be deadlock free if there is no reachable marking such that no transition is enabled.

Liveness
In an informal way, a Petri net is said to be live if it is guaranteed that no matter what firing sequence is chosen, it continues in deadlock-free operation. The formal definition, a Petri net (N,M 0 ) is said to be live if, no matter what marking has been reached from M 0 , it is possible to ultimately fire any transition of the net.
Liveness is an ideal property for many real systems. However, it is very strong and too costly to verify. Thus, the liveness condition is relaxed in different levels. A transition t is said to be live at the following levels: • L 0 Live (dead), if t can never be fired in any firing sequence in L(m 0 ), it is a dead transition.
• L 1 -Live (potentially firable), if it can be fired at least once in some firing sequence in L(m 0 ).
• L 2 -Live if, given any positive integer k, t can be fired at least k times in some firing sequence in L(m 0 ).
• L 3 -Live if there is an infinite-length firing sequence in L(m 0 ) in which t is fired infinitely.
• L 4 -Live (or simply live), if it is L1-Live for every marking m in R(m 0 ).

Persistence
A Petri net is said to be persistent if, for any two enabled transitions, the firing of one transition will not disable the other. Once a transition is enabled in a persistent net, it is continue to be enabled until it fires. Persistency is closed related to conflict-free nets. It is worth noting that all marked graph are persistent, but not all persistent nets are marked graphs. Persistence is a very important property when dealing with parallel system design and speed-independent asynchronous circuits.

Structural liveness
A PN N is said to be structurally live if there is a live initial marking for N.

Structural boundedness
A PN N is said to be structurally bounded if it is bounded for any finite initial marking M 0 .

Structural conservativeness
A PN that provides a constant weighted sum of tokens for any reachable marking when considering any initial marking is said to be structural conservative.

Structural repetitiveness
A PN is classified as repetitive if there is an initial marking m 0 and an enabled firing sequence from m 0 such that every transition of the net is infinitely fired. On the other hand, if only some of these transitions are fired infinitely often in the sequence σ, this net is called partially repetitive.

Consistence
A net is classified as consistent if there is an initial marking m 0 and an enabled firing sequence from m 0 back to m 0 such that every transition of the net is fired at least once. If only some of these transitions are not fired in the sequence σ, this net is called partially consistent.

Stochastic Petri nets
Petri nets [17] are a classic tool for modeling and analyzing discrete event systems which are too complex to be described by automata or queueing models. Time (stochastic delays) and probabilistic choices are essential aspects for a performance evaluation model. We adopt the usual association of delays and weights with transitions [11] in this paper, and adopt the extended stochastic Petri net definition similar to [9]: • O ∈ (N n → N) n×m is a matrix of marking dependent multiplicities of output arcs, where o jk entry of O specifies the possibly marking-dependent arc multiplicity of output arcs from transition t j to place p k . When a transition fires, it removes the number of tokens specified by the input arcs from input places, and adds the amount of tokens given by the output arcs to all output places.
• H ∈ (N n → N) n×m is a matrix of marking-dependent multiplicities describing the inhibitor arcs, where h jk entry of H returns the possibly marking-dependent arc multiplicity of an inhibitor arc from place p j to transition t k . In the presence of an inhibitor arc, a transition is enabled to fire only if every place connected by an inhibitor arc contains fewer tokens than the multiplicity of the arc.
• Π ∈ N m is a vector that assigns a priority level to each transition. Whenever there are several transitions fireable at one point in time, the one with the highest priority fires first and leads to a state change.
• M 0 ∈ N n is a vector that contains the initial marking for each place (initial state).
• Atts : (Dist, W, G, Policy, Concurrency) m comprises a set of attributes for the m transitions, where • Dist ∈ N m → F is a possibly marking dependent firing probability distribution function. In a stochastic timed Petri net, time has to elapse between the enabling and firing of a transition. The actual firing time is a random variable, for which the distribution is specified by F . We differ between immediate transitions (F = 0) and timed transitions, for which the domain of F is (0, ∞). • W ∈ R + is the weight function, that represents a firing weight w t for immediate transitions or a rate λ t for timed transitions. The latter is only meaningful for the standard case of timed transitions with exponentially distributed firing delays. For immediate transitions, the value specifies a relative probability to fire the transition when there are several immediate transitions enabled in a marking, and all have the same probability. A random choice is then applied using the probabilites w t . • G ∈ N n → {true, false} is a function that assigns a guard condition related to place markings to each transition. Depending on the current marking, transitions may not fire (they are disabled) when the guard function returns false. This is an extension of inhibitor arcs. • Policy ∈ {prd, prs} is the preemption policy (prd -preemptive repeat different means that when a preempted transition becomes enabled again the previously elapsed firing time is lost; prs -preemptive resume, in which the firing time related to a preempted transition is resumed when the transition becomes enabled again), • Concurrency ∈ {ss, is} is the concurrency degree of transitions, where ss represents single server semantics and is depicts infinity server semantics in the same sense as in queueing models. Transitions with policy is can be understood as having an individual transition for each set of input tokens, all running in parallel.
In many circumstances, it might be suitable to represent the initial marking as a mapping from the set of places to natural numbers (m 0 : P → N), where m 0 (p i ) denotes the initial marking of place p i . m(p i ) denotes a reachable marking (reachable state) of place p i . In this work, the notation #p i has also been adopted for representing m(p i ).

Dependability
Dependability of a computer system must be understood as the ability to deliver services with respect to some agreed-upon specifications of desired service that can be fully trusted [1,13]. Indeed, dependability is related to disciplines such as fault tolerance and reliability. Reliability is the probability that the system will deliver a set of services for a given period of time, whereas a system is fault tolerant when it does not fail even when there are faulty components. Availability is also another important concept, which quantifies the mixed effect of both failure and repair process in a system. In general, availability and reliability are related concepts, but they differ in the sense that the former may consider maintenance of failed components [8] (e.g., a failed component is restored to a specified condition).
In many situations, modeling is the method of choice either because the system might not yet exist or due to the inherent complexity for creating specific scenarios under which the system should be evaluated. In a very broad sense, models for dependability evaluation can be classified as simulation and mathematical models. However, this does not mean that mathematical models cannot be simulated. Indeed, many mathematical models, besides being analytically tractable, may also be evaluated by simulation. Mathematical models can be characterized as being either state-based or non-state-based.
Dependability metrics (e.g., availability, reliability and downtime) might be calculated either by using RBD or SPN (to mention only the models adopted in this work). RBDs allow to one represent component networks and provide closed-form equations, so the results are usually obtained faster than using SPN simulation. Nevertheless, when faced with representing maintenance policies and redundant mechanisms, particularly those based on dynamic redundancy methods, such models experience drawbacks concerning the thorough handling of failures and repairing dependencies. On the other hand, state-based methods can easily consider those dependencies, so allowing the representation of complex redundant mechanisms as well as sophisticated maintenance policies. However, they suffer from the state-space explosion. Some of those formalism allow both numerical analysis and stochastic simulation, and SPN is one of the most prominent models of such class.
If one is interested in calculating the availability (A) of given device or system, he/she might need either the uptime and downtime or the time to failure (TTF) and time to repair (TTR). Considering that the uptime and downtime are not available, the later option is the mean. If the evaluator needs only the mean value, the metrics commonly adopted are Mean Time to Failure (MTTF) and Mean Time To Repair (MTTR) (other central values might also be adopted). However, if one is also interested in the availability variation, the standard deviation of time to failure (sd(TTF)), and the respective standard deviation of time to repair (sd(TTR)) allow one the estimate the availability variation.
The availability (A) is obtained by steady-state analysis or simulation, and the following equation expresses the relation concerning MTTF and MTTR: Through transient analysis or simulation, the reliability (R) is obtained, and, then, the MTTF can be calculated as well as the standard deviation of the Time To Failure (TTF): Considering a given period t, R(t) is the probability that the time to failure is greater than or equal to t. Regarding exponential failure distributions, reliability is computed as follows: where λ(t ) is the instantaneous failure rate.
One should bear in mind that, for computing reliability of a given system service, the repairing activity of the respective service must not be represented. Besides, taking into account UA = 1 − A (unavailability) and Equation 1, the following equation is derived As well, the standard deviation of the Time To Repair (TTR) can be calculated as follows: Next, MTTF sd(TTF) (and MTTR sd(TTR) ) are computed for choosing the expolinomial distribution that best fits the TTF and TTR distributions [6,22]. Figure 12 depicts the generic simple component model using SPN, which provides a high-level representation of a subsystem. One should notice the trapezoidal shape of transitions (high-level transition named s-transition). This shape means that the time distributions of such transitions are not exponentially distributed, instead they should be refined by subnets. The delay assigned to s-transition f is the TTF and the delay of s-transition r is the TTR. If the TTF and TTR are exponentially distributed, the shape of the transitions should be the regular one (white rectangles) and TTF and TTR should be summarized by the respective MTTF and MTTR. A well-established method that considers expolynomial distribution random variables is based on distribution moment matching. The moment matching process presented in [6] takes into account that Hypoexponential and Erlangian distributions have the average delay (μ) greater than the standard-deviation (σ) -μ > σ-, and Hyperexponential distributions have μ<σ, in order to represent an activity with a generally distributed delay as an Erlangian or a Hyperexponential subnet referred to as s-transition 1 . One should note that in cases where these distributions have μ = σ, they are, indeed, equivalent to an exponential distribution with parameter equal to 1 μ . Therefore, according to the coefficient of variation associated with an activity's delay, an appropriate s-transition implementation model could be chosen. For each s-transition implementation model (see Figure 13), a set of parameters should be configured for matching their first and second moments. In other words, an associated delay distribution (it might have been obtained by a measuring process) of the original activity is matched with the first and second moments of s-transition (expolynomial distribution). According to the aforementioned method, one activity with μ<σ is approximated by a two-phase Hyperexponential distribution with parameters and where λ is the rate associated to phase 1, r 1 is the probability of related to this phase, and r 2 is the probability assigned to phase 2. In this particular model, the rate assigned to phase 2 is assumed to be infinity, that is, the related average delay is zero. Figure 13. Hyperexponential Model Activities with coefficients of variation less than one might be mapped either to Hypoexponential or Erlangian s-transitions. If μ σ / ∈ N, μ σ = 1, (μ, σ = 0), the respective activity is represented by a Hypoexponential distribution with parameters λ 1 , λ 2 (exponential rates); and γ, the integer representing the number of phases with rate equal to λ 2 , whereas the number of phases with rate equal to λ 1 is one. In other words, the s-transition is represented by a subnet composed of two exponential and one immediate transitions. The average delay assigned to the exponential transition t 1 is equal to μ 1 (λ 1 = 1/μ 1 ), and the respective average delay assigned to the exponential transition t 2 is μ 2 (λ 2 = 1/μ 2 ). γ is the integer value considered as the weight assigned to the output arc of transition t 1 as well as the input arc weight value of the immediate transition t 3 (see Figure 14). These parameters are calculated by the following expressions: where If μ σ ∈ N, μ σ = 1, (μ, σ = 0), an Erlangian s-transition with two parameters, γ = ( μ σ ) 2 is an integer representing the number of phases of this distribution; and μ 1 = μ/γ, where μ 1 (1/λ 1 ) is the average delay value of each phase. The Erlangian model is a particular case of a Hypoexponential model, in which each individual phase rate has the same value. The reader should refer to [6] for details regarding the representation of expolinomial distributions using SPN. For the sake of simplicity, the SPN models presented in the next sections consider only exponential distributions. Depending on the system characteristics, a RBD model ( Figure 15) could be adopted instead of the SPN counterpart, whenever the former is more suitable.

Related works
In the last few years, some works have been developed to perform dependability analysis of data center systems [24][26] [27]. Reliability (which encompasses both the durability of the data and its availability for access) correspond to the primary property that data center users desire [2], .
Robidoux [28] proposes Dynamic RBD (DRBD) model, an extension to RBD, which supports reliability analysis of systems with dependence relationships. The additional blocks (in relation to RBD) to model dependence, turned the DRBD model complex. The DRBD model is automatic converted to CPN model in order to perform behavior properties analysis which may certify the correctness of the model [18]. It seems that an interesting alternative would be to model the system directly using CPN or any other formalism (e.g., SPN) which is able to perform dependability analysis as well as to model dependencies between components.
Wei [25] presents an hierarchical method to model and analyze virtual data center (VDC). The approach combines the advances of both RBD and General SPN (GSPN) for quantifying availability and reliability. Data center power architectures are not the focus of their research and the proposed models are specific for modeling VDC.
Additionally, redundancies on components to increase system reliability are costly. [7] propose an approach for reliability evaluation and risk analysis of dynamic process systems using stochastic Petri nets. Different from previous works, this paper proposes a set of models to the quantification of dependability metrics in the context of data center design. Furthermore, the adopted methodology for the quantification of those values takes into account a hybrid modeling approach, which utilizes RBD and SPN whenever they are best suited. The idea of mixing state (SPN) and non-state (RBD) based models is not new (e.g., [23]), but, as far as we are concerned, there is no similar work that applies such technique on the evaluation of data center infrastructures. Besides, a tool is proposed to automate several activities.

Dependability models
The following sections presents the adopted dependability models.

RBD Models
Reliability Block Diagram (RBD) [8] is a combinatorial model that was initially proposed as a technique for calculating reliability of systems using intuitive block diagrams. Such a technique has also been extended to calculate other dependability metrics, such as availability and maintainability [10]. Figure 16 depicts two examples, in which independent blocks are arranged through series (Figure 16(a)) and parallel (Figure 16(b)) compositions. In the series arrangement, if a single component fails, the whole system is no longer operational.
Assuming a system with n independent components, the reliability (instantaneous availability or steady state availability) is obtained by where P i is the reliability -R i (t) (instantaneous availability (A i (t)) or steady state availability (A i )) of block b i .
For a parallel arrangement (see Figure 16(b)), if a single component is operational, the whole system is also operational. Assuming a system with n independent components, the reliability (instantaneous availability or steady state availability) is obtained by where P i is the reliability -R i (t) (instantaneous availability (A i (t)) or steady state availability (A i )) of block b i .
A k-out-of-n system functions if and only if k or more of its n components are functioning. Let p be the success probability of each of those blocks. The system success probability (reliability or availability) is depicted by: For other examples and closed-form equations, the reader should refer to [10].

SPN Models
This section presents two proposed SPN building block for obtaining dependability metrics.
Simple Component. The simple component has two states: functioning or failed. To compute its availability, MTTF and MTTR should be represented. Figure 17 shows the SPN model of the "simple component", which has two parameters (not depicted in the figure), namely X_MTTF and X_MTTR, representing the delays associated to the transitions X_Failure and X_Repair, respectively. Besides, although simple component model has been presented using the exponential distribution, other expolinomial distributions that best fits the TTF and TTR may be adopted following the techniques presented in [22].
Cold standby. A cold standby redundant system is composed by a non-active spare module that waits to be activated when the main active module fails. Figure 18 depicts the SPN model of this system, which includes four places, namely X_ON, X_OFF, X_Spare1_ON, X_Spare1_OFF that represent the operational and failure states of both the main and spare modules, respectively. The spare module (Spare1) is initially deactivated, hence no tokens are initially stored in places X_Spare1_ON and X_Spare1 _OFF. When the main module fails, the transition X_Activate_Spare1 is fired to activate the spare module. Table 2 presents the attributes of each transition of the model. Once considering reliability evaluation (number of tokens (#) in the place X_Rel_Flag = 1), the X_Repair, X_Activate_Spare1 and X_Repair_Spare1 transitions receive a huge number (many times larger than the associated MTTF or MTActivate) to represent the absence of repair. The MTActivate corresponds to the mean time to activate the spare module. Besides, when considering reliability, the weight of the edge that connects the place X_Wait_Spare1 and the X_Activate_Spare1 transition is two; otherwise, it is one. Both availability and reliability may be computed by the probability P{#X_ ON = 1 OR #X_Spare1 _ON = 1}.

Applications
This section focuses in presenting the applicability of the proposed models to perform dependability analysis of real-world data center power architectures (from HP Labs Palo Alto, U.S. [12]). The environment ASTRO was adopted to conduct the case study. ASTRO was validated through our previous work [5] [3] [4].

Architectures
Data center power infrastructure is responsible for providing uninterrupted, conditioned power at correct voltage and frequency to the IT equipments. Figure 19 (a) depicts a real-world power infrastructure. From the utility feed (i.e., AC Source), typically, the power goes through voltage panels, uninterruptible power supply (UPS) units, power distribution units (PDUs) (composed of transformers and electrical subpanels), junction boxes, and, finally, to rack PDUs (rack power distribution units). The power infrastructure fails (and, thus, the system) whenever both paths depicted in Figure 19 are not able to provide the power demanded (500 kW) by the IT components (50 racks). The reader should assume a path as a set of redundant interconnected components inside the power infrastructure. Another architecture is analyzed with an additional electricity generator (Figure 19 (b)) for supporting the system when both AC sources are not operational.  This work adopts a hierarchical methodology for conducting dependability evaluation of data center architectures. In general, the methodology aims at grouping related components in order to generate subsystem models, which are adopted to mitigate the complexity of the final system model evaluation. Thus, the final model is an approximation, but rather simpler, of a more intricate system model. One should bear in mind that the detailed model could be adopted instead, but at the expenses of complexity.

Models
Following the adopted methodology, systems with no failure dependencies between components have been evaluated through RBD models. For instance, Figure 20 depicts the RBD model that represents the architecture A1.  In architecture A2, the generator is only activated when both AC sources are not available. Therefore, a model that deal with dependencies must be adopted. Figure 21 shows the SPN model considering cold standby redundance to represent the subsystem composed of generator and two AC sources. Besides, we assume that UPS' batteries support the system during the generator activation. The reliability or availability is computed by the probability P{#ACSource1_ON = OR #ACSource2_ON = 1 OR #Generator_ON = 1}.
The other components of the architecture A2 are modeled using RBD as shown in Figure 22.
Once obtained the results of both models (RBD and the SPN model with dependencies), a RBD model with two blocks (considering the results of those models) in a serial arrangement is created. The RBD evaluation provides the dependability results of the architecture A2 system. The adopted MTTF and MTTR values for the power devices were obtained from [21] [29] [19] and are shown in Table 3.  the generator has increased the reliability of the architecture A2. Considering the availability results, similar behavior happened. The availability has increased from 5.47 to 7.96 (in number of 9's).

Conclusion
This work considers the advantages of both Stochastic Petri Nets (SPN) and Reliability Block Diagrams (RBD) formalisms to analyze data center infrastructures. Such approach is supported by an integrated environment, ASTRO, which allows data center designers to estimate the dependability metrics before implementing the architectures. The methodology proposes that the system should be evaluated piecewisely to allow the composition of simpler models representing a data center infrastructure appropriately. Moreover, experiments demonstrate the feasibility of the environment, in which different architectures for a data center power infrastructures have been adopted.