Input parameters of SRN models.
Availability quantification and prediction of IT infrastructure in data centers are of paramount importance for online business enterprises. In this chapter, we present comprehensive availability models for practical case studies in order to demonstrate a state-space stochastic reward net model for typical data center systems for quantitative assessment of system availability. We present stochastic reward net models of a virtualized server system, a data center network based on DCell topology, and a conceptual data center for disaster tolerance. The systems are then evaluated against various metrics of interest, including steady state availability, downtime and downtime cost, and sensitivity analysis.
- virtualized servers system
- data center system
- disaster tolerant data center
Data centers (DCs) have been the core-centric of modern ICT ecosystems in recent decades. Computing resources and crucial telecommunications are centralized in a data center to constantly facilitate online business and to connect people from distant parts of the world through the internet. Giant internet companies such as Facebook, Amazon, and Google have built huge state-of-the-art centers to house their own IT infrastructure. According to a study by the Ponemon Institute  regarding the cost of data center outages from 63 DCs located in the United States over a 12-month period, the average cost due to unplanned outages in 2016 was US$ 740,357, which steadily increased by 46% from US$ 505,502 since it was first studied in 2010. Specifically, a minute of downtime costs around US$ 7900 on average. However, online businesses actually face more severe revenue losses due to IT service downtime. In early 2016, Amazon suffered an incredible business loss of US$ 66,240/minute due to server downtime over a period of approximately 15 minutes. The causes of system outages in DCs span from uncertain failures of IT parts/blocks to natural disasters. Therefore, a quantification of IT infrastructure availability in DCs under various scenarios in advance of system development is of paramount importance for big tech companies.
Availability assessment approaches are primarily based on measurement and modeling methods. Model-based approaches are fast and relatively inexpensive methods for system availability analysis in comparison with measurement-based methods. System modeling can be accomplished using discrete-event simulation [2, 3], analytical models, or a hybrid of both approaches. Analytical models fall into four main categories [4, 5, 6, 7]: (i) non-state-space models (reliability graph (RelGraph), reliability block diagram (RBG), or fault tree (FT)), state-space models (Markov chains, Stochastic Petri net (SPN), stochastic reward net (SRN), etc.), hierarchical models, and fixed-point iterative models. Non-state-space modeling paradigms provide a relatively quick evaluation of basic metrics for a system (reliability, availability, MTTF) with a proper capture of overall system architecture. State-space models, on the other hand, can capture sophisticated behaviors and operations of a system. This approach can handle failure/repair dependencies and complex interactions between system components. To avoid the largeness problem (or state-space explosion problem) in state-space models, we use hierarchical modeling techniques of non-state-space and state-space models at upper and lower levels, as well as fix-point iterative models. In this chapter, we focus on studying complex system operations in DCs captured by using an SRN.
The structure of this chapter is organized into six sections. Section 2 provides preliminary concepts of availability modeling and analysis of data center systems (DCS). Subsequently, several case studies are presented. Section 3 offers an availability model of a unit system of the virtualized server (VSS) in DCs. In Section 4, we present availability modeling of a data center network (DCN) based on DCell topology. We present an SRN model for a DC in order to study disaster tolerance in Section 5. Finally, we present conclusions in Section 6.
2. Availability quantification of data center systems: basic concepts
Availability A(t) of a DCS represents the probability of its operating system taking the correct state at an instant t, regardless of the number of failures and repairs during the interval (0,t). Instantaneous/point availability A(t) is related to the system reliability, as defined in Eq. (1).
R(t) is the instantaneous reliability at t of the system, which is defined in Eq. (2):
f(x) is the probability density function of a random variable X, which represents the system’s lifetime or time to failure.
g(x) is a renewal process rate in the interval (0,t), as defined in Eq. (3)
m(x)dx is the probability that a renewal process cycle will be completed in the time interval [x,x + dx]. R(t-x) is the probability that the system works properly for the remaining time interval t-x. R(t-x)m(x)dx is the probability of the case that a fault has occurred and that after the repair/renewal (which occurred at the instance x, 0 < x < t), the system resumed functioning with no further faults. If a system is not repairable, the concept of A(t) is identical with that of reliability R(t).
The failure rate (λ) implies the frequency of system failure is determined by the total number of failures within an item population, divided by the total time expended by that population, during a particular measurement interval under the stated conditions. Repair rate (μ) implies the frequency of system repair determined as the average number of repairs over a period of maintenance time. Mean time to failure (MTTF) represents the expected time in which a system functions correctly before its first failure. Mean time to repair (MTTR) represents the expected time required for system repair. In the case where failure/repair events comply with exponential distributions, MTTF and MTTR represent an arithmetic inversion of failure and repair rates, as shown in Eq. (6). SSA can be computed from Eq. (5).
In industry, system administrators are usually concerned with system downtime (measured in minutes per year) and downtime cost (with a cost unit C per minute of system downtime). These values can be computed with Eq. (7) and (8).
Sensitivity analysis is performed to assess the importance of system parameters by two techniques. (i) Repeatedly substitute specific parameter values in one range at a time while the others remain constant, and observe system behaviors in accordance with the variation of the selected parameter. This approach studies the system responses upon a broad range of the parameters under consideration. (ii) Differential sensitivity analysis: compute partial derivatives of the measure of interest with respect to each system parameter as determined in Eq. (9) or (10) to yield a scaled sensitivity.
Stochastic reward net (SRN)  has been an appropriate modeling paradigm to capture operational complexities in industrial hardware and software systems [9, 10, 11, 12, 13, 14]. According to a specific description of system operations, ones can model system behaviors using place(s), transition(s) and arc(s) as three main components in an SRN model. To represent a certain entity of the system to be considered, we use token(s) (normally denoted by a dot or an integer number to represent a number of corresponding entities) which reside in each place of the SRN model. And to capture its operational state variations, we use (input/output) arcs to connect transition(s) to place(s) or place(s) to transition(s), respectively. A firing of a transition is triggered when a certain condition of system state is matched in order to allow the token(s) in a place are removed, and then deposited in another place. The transitions of tokens in an SRN model captures the system’s operations while the residence of tokens in places represent the system’s operational state at a time, which is call marking. The Boolean condition attached to each transition which is to enable/disable the transition is called the guard. A set of guard functions can be defined to articulate the behaviors of system state dependence and transition. A marking-dependence (denoted by a # sign attached to a transition) is incorporated when the transition’s rate is dependent on the marking of the SRN model at a time. Other features of SRN including inhibitor arcs, multiplicities, and input arcs can simplify the construction of SRN models.
SRN-based availability quantification framework is presented in Figure 1. The availability quantification framework consists of three stages: (i) requirement specification, (ii) SRN-based system modeling and (iii) system analysis. Service level agreement (SLA) [15, 16] between system owner and customer details system specification and requirements. In the stage (i), taking into account the literature review based on prior art and contemporary development of the system, ones can define problem statements to be modeled and observed. In the stage (ii), the person in charge of modeling and evaluating the system can refer various default values of system parameters from previous work. He/she can propose the architecture design and detailed behaviors taken into consideration of the system. The SRN is used to capture the pre-defined system operations. The SRN system model is then analyzed and the system availability evaluation is performed with regard to various output measures of interest via different analysis approaches such as steady-state availability and/or sensitivity analysis.
3. Case study I: a virtualized server system
3.1. System architecture
Figure 2 shows a general VSS architecture. A VSS is a computing unit in a DC which consists of a number of physical servers (also called hosts H1, H2, …, Hn). Each server is in turn virtualized using bare-metal virtualization technology [17, 18, 19]. Thus, each server hosts its own hypervisor (hereinafter, called the virtual machine monitor (VMM)). The physical server is capable of running a number of virtual machines (VM) on top of its VMM. For the sake of fault tolerance and data storage of VMMs and VMs, the physical servers are interconnected via a network pipeline to each other, and to a shared storage area network (SAN).
To focus on modeling complex behaviors of a virtualized system in a detailed manner, we consider a small-size VSS consisting of two hosts (H1 and H2) connected to a shared SAN. Each host runs its own virtual machine monitors VMM1 and VMM2, respectively. Two VMs are also created on each host, VM1 for host H1 and VM2 for host H2. In the next section, we will present SRN models of the above-mentioned subsystems. The models capture in detail various failure modes and recovery methods, including hardware failures in physical hosts and SAN [20, 21], failures due to non-aging related Mandelbugs on both VMM and VM subsystems , and software aging-related failures and corresponding time-interval software rejuvenation techniques for VMM and VM subsystems [23, 24]. Furthermore, we incorporate hierarchically complex dependencies between subsystems, including the dependences of a VM on its VMM, a VM on the shared SAN, and a VMM on its host. Without loss of generality, the proposed SRN model represents the sophisticated operations of, and interactions between subsystems, in a typical virtualized system as a computing unit brick in a practical DC. The model can be further extended in the future by incorporating a large scale cloud system as in .
3.2. SRN models of VSS
The SRN system model is presented in Figure 3. We use a two-state SRN model to capture the operational state (UP) and failed state (DOWN) of the physical parts, including host 1 (H1), host 2 (H2), and SAN, as shown in Figure 3(a)–(c), respectively.
The VMM subsystem models are shown in Figure 3(d) and (f) for VMM1 and its clock, respectively, and in Figure 3(e) and (g) for VMM2 and its clock, respectively. Without loss of generality, a model of a VMM (either VMM1 or VMM2) subsystem consists of six states (represented by shaded places): (i) normally running state (PVMMup), (ii) failure state due to non-Mandelbugs (PVMMf), (iii) down-state due to a failure of its underlying host (PVMMdn), (iv) failure-probable state due to aging problems (PVMMfp), (v) aging-failure state due to aging of equipment (PVMMaf), and (vi) rejuvenation-process state (PVMMrej). Initially, there is a token in PVMMup to represent a running VMM. If it fails due to a non-aging Mandelbug, the transition time TVMMf is fired to transit the token into PVMMf. Recovery is captured by TVMMrepair. After running for a long time, the VMM suffers a high failure probability while remaining operational. Therefore, it goes to the failure-probable state PVMMfp as TVMMfp is fired. Failure due to aging occurs soon after TVMMaf is fired and the VMM goes to the aging-failure state PVMMaf. Its recovery is represented by the firing of TVMMar. If the VMM’s underlying host goes down (i.e., a token is deposited in PHf in respective Figure 3(a) or (b)) while the VMM is in the UP states (normal PVMMup or failure-probable PVMMfp), the VMM immediately enters the down-state PVMMdn through the immediate fired transitions tVMMupdn or tVMMfpdn. A reset is necessary for the VMM to go up (captured by TVMMreset) after its host is recovered. In the meantime, the VMM clock is initiated by a token in PVMMclock, which counts time by firing a timed transition TVMMclockinterval that complies with the cVMM-stage Erlang distribution. Every software rejuvenation process interval on a VMM is represented by a firing of TVMMclockinterval, and the token in PVMMclock is removed and deposited in PVMMpolicy. Thus, rejuvenation is triggered if there is a VMM in PVMMup or PVMMfp by firing the immediate transitions tVMMuprej or tVMMrej. Also, the token in PVMMpolicy of the VMM clock model is moved to PVMMtrigger. The VMM represented by a token in PVMMrej is then rejuvenated and returned to the normal state PVMMup as TVMMrej is fired. The VMM clock is reset as tVMMclockreset is fired to start a new interval of time-based software rejuvenation on a VMM. The modeling of VMM1 on host H1 and VMM2 on host H2 are identical based on the general model description as above.
Modeling of VM subsystems is shown in Figure 3(h) and (j) for VM1 subsystem and its clock, respectively, and Figure 3(i) and (k) for VM2 subsystem and its clock, respectively. The models initiate with two tokens in PVMup representing two VMs on each host. In general, the SRN model of a VM subsystem also consists of six states as in the VMM subsystem does including: (i) normal state (PVMup), (ii) failure state due to non-aging Mandelbugs (PVMf), (iii) down-state due to a failure of underlying VMM (PVMdn), (iv) failure-probable state due to aging problems (PVMfp), (v) aging-failure state due to a failure of aging (PVMaf) and (vi) rejuvenation-process state (PVMrej). The operations of the VM subsystem in correspondence with the transitions of tokens in the SRN model are similarly described as those of the VMM subsystem. However, the SRN model of the VM subsystem is further extended by incorporating (i) marking-dependence represented by a “#” mark nearby selected timed transitions (TVMfp, TVMf, TVMreset) to capture the cases in which two VMs in the same state compete with each other in order to transit to a new state and (ii) dependence between the VM subsystem and SAN. The second dependence is captured by the immediate transitions tVMupo, tVMfo, tVMdno, tVMfpo, tVMafo, and tVMrejo in the VM model, and tVMclocko, tVMpolicyo, and tVMtriggero in the VM clock model. As the SAN fails (depicted by a token in PSANf), these transitions are fired to remove tokens in the VM model and VM clock model, regardless of their locations representing the loss of VM images on SAN and VM clock functionalities. Nevertheless, as soon as the SAN is recovered, two VMs are immediately created on the SAN, and they are booted onto a VMM of a corresponding host. The creation of multiple VMs is captured by tVMstop, whereas the booting of a VM in the sequence is captured by TVMboot with marking-dependence. The VM clock is also started after the recovery of a SAN, as captured by PVMclockstop and two immediate transitions tVMclockstop and tVMclockstart.
3.3. Availability analysis scenarios and results
Steady-state availability: We conducted numerical experiments in seven case studies with regard to different rejuvenation combinations. The case studies are described along with analysis results of SSA of VMM and SSA of VM in Table 2. The reward functions used to compute SSAs are defined as
where #PX is the number of token in place PX. The results show that the following:
Time-based rejuvenation techniques with default parameters, when implemented on both VMM and VM subsystems in combination does not gain the highest SSA for the virtualized system. When a VMM undergoes a rejuvenation process, it pulls down all VMs running on top of the VMM;
Rejuvenation on VMM exposes more effectiveness in gaining higher SSA in comparison to the VM.
An appropriate rejuvenation combination implemented on either a VMM or VM with proper clock intervals can actually enhance system availability.
Sensitivity analysis of SSA: The sensitivity analysis is observed in five case studies w.r.t the variation of: (i) only VMM1 clock’s interval; (ii) only VM1 clock’s interval; (iii) both VMM1 and VMM2 clocks’ interval; (iv) both VM1 and VM2 clocks’ interval; and (v) all clock intervals with the same duration, as shown in Figure 4. The findings are as follows:
Figure 4(a) and (b) shows that rejuvenation processes on VMM reduce SSA of the VM, but those on VM can improve. A proper combination of rejuvenation processes on the VMM and VM can yield an efficient impact for maintaining high values of SSA of VM.
Figure 4(c) and (d) shows that there is no dependence of a VMM on its VM incorporated in the modeling of the proposed VSS yet. Also, rejuvenation implemented on both VMM subsystems of both hosts obviously gains higher SSA of VMM than it would if implemented on only one of the VMM subsystems.
|μhr||Host repair||TH1r, TH2r||3 days||λhf||Host fail||TH1f, TH2f||1 years|
|λvmmf||VMM non-aging failure||TVMM1f, TVMM2f||2654 hours||λvmf||VM non-aging failure||TVM1f, TVM2f||2893 hours|
|μvmmr||VMM reset||TVMM1reset, TVMM2reset||1 min||δvmr||VM repair||TVM1repair, TVM2repair||30 min|
|δvmmr||VMM repair||TVMM1repair, TVMM2repair||100 min||μvmr||VM restart||TVM1reset, TVM2reset||50s|
|βvmmfp||VMM failure-probable||TVMM1fp, TVMM2fp||2 months||βvmfp||VM failure-probable||TVM1fp, TVM2fp||1 month|
|λvmmaf||VMM aging-failure||TVMM1af, TVMM2af||2 weeks||λvmaf||VM aging failure||TVM1af, TVM2af||1 week|
|μvmmar||VMM aging recovery||TVMM1ar,TVMM2ar||120 min||μvmar||VM aging recovery||TVM1ar, TVM2ar||120 min|
|τvmm||VMM clock interval||TVMM1clockinterval, TVMM1clockinterval||1 week||τvm||VM clock interval||TVM1clockinterval, TVM2clockinterval||3.5 days|
|βvmmrej||VMM rejuvenation||TVMM1rej, TVMM1rej||2 min||βvmrej||VM rejuvenation||TVM1rej, TVM2rej||1 min|
|1 year 3 days||ηvmb||VM booting after VMM rejuvenation||TVM1boot, TVM2boot||50s|
|cVMM||cVMM-stage Erlang distribution||x||10||cVM||cVM-stage Erlang distribution||X||10|
|Cases||Description||SSA of VMM||SSA of VM|
|I||Rejuvenation is applied on all VMM and VM subsystems in both hosts.||0.999912470996||0.991769547666|
|II||Rejuvenation is not applied only on one of VMM subsystems in two hosts but applied on both VM subsystems in two hosts.||0.999908948744||0.991766082049|
|III||Rejuvenation is applied on both VMM subsystems in two hosts but not applied to only one of two VM subsystems.||0.999912470996||0.991770317258|
|IV||Rejuvenation is not applied on haft side of the system including VMM1 and VM1 subsystems but applied on VMM2 and VM2 subsystems.||0.999908948744||0.991766912872|
|V||Rejuvenation is not applied on both VMM subsystems in two hosts but applied on both VM subsystems.||0.999905284754||0.991763344539|
|VI||Rejuvenation is applied on both VMM subsystems in two hosts, but not applied on both VM subsystems.||0.999912470996||0.991771080172|
|VII||Rejuvenation is not applied on VMM and VM subsystems in both hosts.||0.999905284754||0.99176419998|
4. Case study II: a DCell-based data center network
4.1. A typical DCN architecture
In this section, the DCell in consideration is expanded in size up to a network of virtualized servers complying a DCell topology. A DCell  is recursively constructed based on the most basic element DCell0 as follows:
A DCell0 consists of n physical servers connected to an n-port switch.
A DCell1 is composed of n + 1 DCell0s. Each server of a DCell0 in a DCell1 has two links. One connects to its switch, the other connects to the corresponding server in another DCell0, complying with a predetermined DCell routing algorithm. Consequently, every pair of DCell0s in a DCell1 has an exact unique link between each other.
A DCellk is a level-k of DCellk-1.
To apply the proposed modeling approach using SRN, we focus on studying a special case of DCell-based DCN at level 1 (DCell1). Particularly, a cell DCell0 consists of two physical servers and one shared switch. DCell1 is composed of three DCell0s, as shown in Figure 5. We assume that each server has two NICs, one for connecting to the switch in the same cell, and the other for direct connection between the server in a cell and the corresponding server in another cell, which complies with DCell network routing topology. The system architecture is detailed as follows: (i) DCell0 consists of switch S0, two hosts H00 and H01, a number of VMs (n00 of VM00 and n01 of VM01) on the hosts H00 and H01, respectively; (ii) the description of other cells goes in the same manner.
4.2. Proposed SRN model
The SRN system model of the DCell-based DCN is presented in Figure 6. To simplify the modeling and to focus on sophisticated interactions between VMs and servers in a cell and in different cells of the network, we use two-state SRN models (consisting of UP and DOWN states) for physical parts of the system, including hosts and switches, as shown in Figure 6(a)–(j). Initially, there is a token in the UP state for each model of a certain physical part, which is depicted by a black dot which represents the initial normal working state of the physical hosts and switches. Contrary to the presented case-study of VSS in Section 3, we do not take into account the modeling of the VMM subsystem. Instead, we combine host and VMM in a unique model by considering the mean time to failure equivalent (MTTFeq) and mean time to repair equivalent (MTTReq) of the VMM subsystem as input parameters in the two-state models of hosts. Also, we simplify the modeling of the VM subsystem by using only two-state SRN models as shown in Figure 6(g) (VM subsystem model). There is an initial number of VMs on each host in a general case as represented by tokens in UP states. Specifically, there are n00 of VMs in PVM00up, and n01 of VMs in PVM01up in cell DCell0. In DCell0, the numbers of VMs initially running in a normal state on each host are n10 of VM10, and n11 of VM11, which are hosted on H10 and H11, respectively. Those numbers in DCell0 are n20 of VM20 and n21 of VM21. Unlike the SRN model of a single unit of VSS in Figure 3, we capture in the SRN system model the VM live migration techniques within a cell and between different cells for the sake of fault tolerance and improvement of system availability.
The VM migration is implemented between two hosts in a cell when a host in the cell experiences downtime due to a certain failure. In cell DCell0 for instance, the VM live migration is triggered to migrate all running VMs from the host H00 to the host H01 immediately when the host H00 fails (represented by a token in PH00dn). The immediate transition tH00f is triggered to remove all tokens in PVM00up and deposit them in PVM01mig. As the timed transition TVM01mig is fired, the tokens in PVM01mig are removed and deposited in PVM01up, representing the completion of VM live migration processes from H00 to H01. If host H01 fails (i.e., a token is placed in PH01dn), the VM live migration is performed from H01 to H00 and is captured by the immediate transition tH01f (to trigger VM live migration processes), the place PVM00mig (the state of a VM in migration), and the timed transition TVM00mig (to represent the migration processes that take time to complete). The description of VM live migration within a cell occurs in the same manner for other cells DCell0 and DCell0.
In the case of a failed switch in a cell, VM live migration is performed between two hosts in two different cells via a peer-to-peer connection. For instance, if switch S0 fails, the connections between the two hosts H00 and H01 in cell DCell0 and the two host connections to outside users are disrupted. However, the number of VM00 and VM01 are still running on hosts H00 and H01, respectively. It is necessary to migrate these VMs to other cells in order to enhance the overall availability of the system. The VM migration processes from cell DCell0 to the other two cells are triggered by the two immediate transitions tVM01m (to migrate VMs from DCell0 to DCell0) and tVM02m (to migrate VMs from DCell0 to DCell0). After that, the tokens in PVM00up are removed and deposited in PVM01m and are then deposited in PVM10up in cell DCell0 as TVM01m is fired. The transition of tokens PVM00up in DCell0 to PVM10up in cell DCell0 captures the migration of VM on host H00 after a failure of switch S0 between the two different cells. On the other side, the tokens in PVM01up are removed and deposited in PVM02m and are then deposited in PVM20up in cell DCell0. This represents the migrations of VMs on host H01 after the failure of switch S0 from cell DCell0 to cell DCell0.
Without loss of generality, the VM live migration techniques within a cell and between two cells are described in detail as above for cell DCell0. These migrations apply similarly to the other cells DCell0 and DCell0.
4.3. Availability evaluation
The proposed SRN models are all implemented in SPNP. The default input parameters are listed in Table 3. To reduce the complexity of model analysis, we initiate only one VM on each host H00 and H01 in cell DCell0 in the default case, and there are no other VMs in the other cells. However, we also evaluate the impact of the number of VMs in the DCN on the overall system availability. In this case-study, we consider two different evaluation scenarios: (I) a standalone DCell0 (with two hosts and one switch), and (II) the proposed three-cell DCN (as modeled above). The reward rates used to compute SSA of the two cases are defined as follows:
We first evaluate SSA and downtime of the two scenarios as shown in Table 4. We assume that a minute of system downtime incurs a penalty of 16,000 USD for the system owner according to the SLA signed with customers . The results clearly show that the proposed three-cell DCN obtains much higher availability, and thus reduce downtime minutes and downtime cost penalty in a year than a standalone cell with only two physical servers.
We also evaluate the impact of the initial number of VMs in a DCN on the system’s overall availability, as shown in Table 5. The results show that as we increase the initial number of VMs, the overall system availability also increases. The increased SSA in the proposed three-cell DCN is also faster than in the standalone DCell0. However, if the initial number of VMs (represented by the total number of tokens in the proposed SRN system model) obtains a large value, it causes a memory error in computing the system availability due to the largeness problem of the SRN model.
Sensitivity analysis of SSA: We observe the variation of SSA in accordance with changes in the selected input parameters, including MTTF and MTTR of hosts, VMs and switches, and VM migration rate between two hosts in a cell or in two different cells, as shown in Figure 7. The results show that:
SSA is improved as we increase MTTFs and VM migration rates, and as we decrease MTTRs.
In Figure 7(a), we see that the switch is an important component of the network because its MTTF is small. Thus, the SSA clearly drops down vertically in comparison to the MTTFs of other components. Furthermore, MTTF of a host is a significant parameter in the long-run since it causes a better enhancement in the overall availability than the other MTTFs.
In Figure 7(b), we clearly find that the repair time of a switch does not affect the SSA because we perform VM migration between cells to tolerate the failures of switches. This ensures that VMs can be migrated to other cells, regardless of the failure/recovery of a certain switch. However, we can see that the recovery of a VM has a greater impact on SSA than that of a host.
In Figure 7(c), the migration rates of VMs between cells can clearly enhance SSA in comparison with those within a cell. However, the low value of the VM migration rate within a cell severely drops the system’s availability.
|λH||Host failure rate||800 hours||μH||Host repair rate||9.8 hours|
|λVM||VM failure rate||4 months||μVM||VM repair rate||30 min|
|λS||Switch failure rate||1 year||μS||Switch repair rate||24 hours|
|ωmig||Network bandwidth within a DCell0||1 GB/s||ωm||Network bandwidth between two DCell0s||256 Mb/s|
|SVM||VM image size||10 GB||n00, n01||No. Of initial VMs in Dcell0||1|
|Case||Description||SSA||No. of nines||Downtime (min/year)||Downtime cost (USD/year)|
|II||Proposed three-cell DCN||0.999950276761||4.30||26.1||418,152|
5. Case study III: a Disaster Tolerant Data Center (DTDC)
5.1. A typical system architecture of a DTDC
This case-study considers disaster tolerance of cloud computing in a DCS. The system is composed of two different DCs (DC1 and DC2), which are geographically located in two distant regions, as shown in Figure 8. In each DC, we place a VSS of two physical servers (H1 and H2 in DC1, and H3 and H4 in DC2). All physical machines are assumed to be identical. Each server is initially capable of running a VM (VM1˜VM4 runs on H1~H4, respectively). Shared network attached storage (NAS) is equipped in each DC to provide distributed storage and a VM migration mechanism between two hosts in the same DC. To implement disaster tolerance and recovery strategies between DCs, a back-up server is incorporated to provide VM data backup. The back-up server allows periodic synchronization of VM data between DCs. This allows the most-updated VM data to be recovered onto an operational DC after a disaster strikes on another DC.
Furthermore, to enhance the system’s overall availability, we use the (active-standby) fail-over technique and VM switching mechanism. Specifically, when a VM on a certain host fails, a standby VM on the same host wakes up and takes over the operations of the failed VM. If there is no standby VM on the same host, the standby VM on the remaining host goes up and takes place on the failed host.
If a host in a DC fails, its VMs in the standby state are switched on in order to load onto the remaining host. Various VM migration mechanisms are also taken into account in this system. VM live-migration is performed between two hosts in a DC when one of the hosts fails. VM migration between two DCs is triggered when a DC undergoes a system failure when two hosts enter a downtime period simultaneously. When a disaster devastates a DC, VM migration between the back-up server (in a safe zone) and the remaining operational DC is implemented as a means of disaster recovery.
5.2. Availability modeling of a DTDC
The SRN system model for availability quantification of the studied DTDC is shown in Figure 9. We use simplified two-state SRN models (UP and DOWN) to capture general failure and recovery behaviors of physical parts in the system, including the physical hosts H1–H4 (Figure 9(a), (b), (j), and (i), respectively), NAS1 in DC1, and NAS2 in DC2 (Figure 9(c) and (h), respectively). We use immediate transitions tHupo, tHdowno, tNASupo, and tNASdowno to remove tokens in the up and down places of the host and NAS models in order to represent the entire operational termination of a DC when a disaster strikes. When the disaster passes and the reconstructed DC starts a new operational cycle, the immediate transitions tHupin and tNASupin are used to deposit new tokens in the up states of the host and NAS models. The occurrence of a disaster at a site is also represented by using a two-state model as shown in Figure 9(d) and (g) for the occurrence of a disaster at DC1 and DC2, respectively. The two-state SRN model in Figure 9(f) captures the operational and failure states of the back-up server.
The modeling of VM subsystems in DC1 and DC2 are shown in Figure 9(e) and (k), respectively. Since we initially assume that all hosts and VMs are identical, the modeling of the two DCs is also identical. The model initializes N tokens in PVM1up, and the other N tokens in PVM2std represent N operational VMs with their N standby VMs at the beginning. Each VM sub-model mainly has four states, including the operational state (PVMup), failure state (PVMfail), standby state (PVMstd), and synchronization state (PVMsync). If a VM fails, it moves from the upstate PVMup to the failure state PVMfail. When the failed VM is repaired, it moves to the standby state PVMstd. At this point, the active-standby fail-over mechanism of VMs is captured as follows. When a VM fails, a standby VM (represented by a token in PVMstd) on the same host (before the disaster) or on the remaining host (after the disaster) transits to PVMsync in order to synchronize the most-updated data on the NAS of that DC corresponding to the previously failed VM. It then goes up to PVMup and takes the place of the failed VM. Dependence marks are placed near timed transitions TVMfail and TVMrepair to represent the competition between failure and repair of VMs on the same host. The VM live-migration technique is triggered as a host fails, which is captured by an immediate transition tVMm, a place PVMSm, and a timed transition TVMmigrate. For instance, when host H1 fails, the VM live-migration is triggered to migrate running VMs from the failed H1 to the running H2. Thus, tVMm12 is triggered to fire. A number of tokens in PVM1up are removed and deposited at PVMS1m12 as it waits for migration. The timed transition TVM2migrate is then fired to depict the migration process of VMs onto host H2. The tokens in PVMS1m12 are removed and deposited in PVM2up. The reversed migration from host H2 to H1 is captured by tVMm21, PVMS1m12, and TVM1migrate in the same manner. The places PVMS1m and PVMS2m represent the storage of VMs on NAS1 and NAS2. When the two hosts in a DC enter downtime, all tokens in the VM sub-models of VM1 and VM2 are removed by immediate transitions tVMupo, tVMfailo, tVMstdo, and tVMsynco (attached to four main states of VM sub-models) and deposited in PVMS1m via tVMS1min. However, if a disaster strikes, the all tokens are removed from the places in the VM sub-models via the out-going immediate transitions tVMupo, tVMfailo, tVMstdo, tVMsynco, tVMSmo, and tVMSmo. As the failed data center is reconstructed, a pre-defined number of VMs are created on the NAS, which is captured by depositing tokens in PVMSm via tVMSmin. The VMs are then assigned to hosts via the time transition TVMSmin.
The VM migration techniques between the two DCs, and between the backup server and the two DCs, are modeled in Figure 9(l). The place PVMB represents the storage of VMs in the back-up server. When a DC is destroyed due to a disaster, its VMs are stored in the back-up server and represented by creating new tokens in PVMB via the timed transition TVMBin. When there is a remaining DC in its operational state, the tokens in PVMB are transmitted to the corresponding PVMSmig via the timed transition TVMSpre. The tokens are then deposited in PVMSm via the timed transition TVMSm of the respective DC model with an imperfect coverage factor CBmig. If this process fails with coverage factor (1-CBmig), the tokens are moved to PVMS2mf via TVMSmf and returned to PVMB via TVMSmfrec. This transition of tokens captures the VM migration from the back-up server to the operational DC. In the case when the back-up server fails, the immediate transitions tVMBo, tVMSmigo, and tVMSmfo remove all tokens in PVMB, PVMSmig, and PVMSmf to represent the loss of VM image files on the back-up server. The VMs will be created on the back-up server as soon as it is recovered. The VM migration between two DCs is triggered when two hosts in a DC enter downtime simultaneously. In this case, we propose the two hosts H1 and H2 in DC1 also stay in a downtime period simultaneously. A number of VMs on DC1 are still stored in NAS1, represented by tokens in PVMS1m. Thus, it is necessary to migrate these VMs onto the running DC2. The tokens are then transmitted to PVMS12mig after a pre-migration process (TVMS12pre). The VM migration process is finalized with an imperfect coverage factor Cmig as the transition TVMS12mig is fired. If this migration process fails with coverage factor (1-Cmig), the tokens are moved to PVMS12migfo and returned to NAS1 in the original DC1 via TVMS12migrec. The VM migration from DC2 to DC1 is performed similarly and captured by the places PVMS21mig, PVMS21migf, the timed transition TVMS21pre, TVMS21mig (with imperfect coverage factor Cmig), TVMS21migf (with coverage factor 1-Cmig), and TVMS21migrec.
5.3. Availability evaluation
The SRN system model is implemented in SPNP. Default input parameter values are shown in Table 6. We assume that the number of VMs on a host is only one in order to reduce complexity in model computation and analysis.
Steady state availability: We evaluate the availability of the DTDC in seven operational scenarios by varying imperfect VM migration coverage factors between the backup server and the DCs and disaster occurrence frequency as follows: (I) The system of two standalone DCs without DT confronts disasters at the mean time to occurrence of 100 years (default value); (II) The system with default parameters; (III-V) The network connection has a high probability of failure (i.e., low probability of success in VM migration processes) and the system is planted in an area with mean disaster time set alternatively to 100, 200, and 300 years; (VI-VIII) In contrast to cases (III)-(V), the migration between distant parts may succeed with high probability and the DCs location experiences disasters with mean time to occurrence also set to 100, 200, and 300 years. The results of SSA and downtime evaluation are shown in Table 7 such that following criteria are satisfied:
The safer DCs locations (longer frequency of disaster occurrence) results in a higher system SSA.
DCs should be placed in isolated areas to avoid any severe damage from disastrous events, even though the network connection between distant parts of the system might deal with more failure during VM migration processes.
Higher SSA values are obtained with more reliable network connections, i.e. for network connections that can guarantee a higher success rate for transmission between distant parts of the system.
Sensitivity analysis: As shown in Figure 10, we analyzed the sensitivity of the system’s SSA with respect to different parameters, including imperfect coverage factors of VM migration (CBmig and Cmig), time to disaster occurrences (λDCoccur), VM image size (SVM), and network bandwidth (ωNET). The impact of SVM and ωNET is shown in Figure 10(f). The results show that: (i) the disaster tolerance solution with a back-up center would improve SSA, even when connections between the back-up center with DCs incur imperfections in VM migration processes; (ii) imperfections in the VM migration processes between DCs slightly impact SSA when it increases; (iii) the system’s SSA is improved vastly if DCs are located in safe areas with lower disaster occurrence frequency; (iv) larger VMs can reduce the overall availability of the system; (v) a faster network connection between distant locations can actually boost the system’s availability, especially for network speeds ranging in 0-20 Mb/s, if the speed increases much higher, the effect is not much different from the default parameters; (vi) the variation of both (ωNET, SVM) confirms the fact that higher network speed and smaller VM sizes result in apparently higher SSA, whereas slower network and larger VMs severely reduce the system’s availability.
|λHf||Host failure rate||TH1f, TH2f, TH3f, TH4f||800 hours|
|μHr||Host recovery rate||TH1r, TH2r, TH3r, TH4r||9.8 hours|
|λNASf||NAS failure rate||TNAS1f, TNAS2f||45 years|
|μNASr||NAS recovery rate||TNAS1r, TNAS2r||4 hours|
|λDCoccur||Time to disaster occurrence at a DC||TDC1occur, TDC2occur||100 years|
|μDCr||DC recovery rate after a disaster||TDC1r, TDC2r||1 year|
|λBf||Backup DC failure rate||TBf||50,000 hours|
|μBr||Backup DC recovery rate||TBr||30 min|
|λVMfail||VM failure rate||TVM1fail, TVM2fail, TVM3fail, TVM4fail||4 months|
|μVMrepair||VM repair rate||TVM1repair, TVM2repair, TVM3repair, TVM4repair||30 min|
|δVMsync||VM synchronization rate||TVM1sync, TVM2sync, TVM3sync, TVM4sync||5 min|
|ωVMmigrate||VM migration rate between hosts||TVM1migrate, TVM2migrate, TVM3migrate, TVM4migrate||5s|
|γVMSmin||VM loading rate into a host||TVMS1min1, TVMS1min2, TVMS2min3, TVMS2min4||1 s|
|ηVMSpre||VM pre-migration rate between DCs and backup server||TVMS12pre, TVMS21pre, TVMS1pre, TVMS2pre||5 min|
|θVMSmigrec||VM return rate to NAS after a migration failure||TVMS12migrec, TVMS21migrec||1 min|
|θVMSmfsync||VM synchronization rate with backup DC after a migration failure||TVMS1mfsync, TVMS2mfsync||1 min|
|CBmig||Imperfect factor of VM migration from backup DC||0.95|
|Cmig||Imperfect factor of VM migration between DCs||0.85|
|N||Number of VMs in a host||1|
|SVM||Size of VM image and related data||4GB|
|ωNET||Network speed||20 MB/s|
|Case||CBmig||Cmig||λDCoccur||SSA||No. of nines||Downtime (min/year)||Downtime cost (USD/year)|
This chapter presented a set of availability models based on stochastic reward net for comprehensive system availability evaluation in data center systems. The data center systems scale during evaluation was increased from a system of two virtualized servers (considered as a unit block in data centers) in Section 3, to a typical network of virtualized servers complying with a DCell topology in Section 4. Finally, the evaluated data centers are scaled up to a two-site data center for disaster tolerance with a back-up center. A variety of fault and disaster tolerant techniques were incorporated in the systems in order to achieve high availability. The systems were evaluated under various case studies with regards to different metrics of interest, including steady state availability and its sensitivity with respect to a number of impac factors. The analysis results show comprehensive system behaviors and improved availability in accordance with incorporated techniques in the data center systems.
This research was supported by the Ministry of Science, ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2018-2016-0-00465) supervised by the Institute for Information & communications Technology Promotion (IITP).