Open access peer-reviewed chapter

Condition-Based Maintenance for Data Center Operations Management

Written By

Montri Wiboonrat

Submitted: 30 May 2020 Reviewed: 09 September 2020 Published: 26 October 2020

DOI: 10.5772/intechopen.93945

From the Edited Volume

Operations Management - Emerging Trend in the Digital Era

Edited by Antonella Petrillo, Fabio De Felice, Germano Lambert-Torres and Erik Bonaldi

Chapter metrics overview

1,161 Chapter Downloads

View Full Metrics

Abstract

This chapter presents data center operations management by giving four case studies of power distribution systems (PDS) of data centers (Tier I, Tier II, Tier III, and Tier IV). The four topologies of PDS have defined by the design of single points of failure and redundant equipment and systems. The concepts of Mean Time between Failures (MTBF) and Mean Time to Repair (MTTR) apply during PDS design for reduced system downtime. Moreover, MTBF and MTTR use for estimating system availability of each Tier classification. Human factors consider as critical part of data center operations that need to quantify and qualify on knowledge and skills such as certified levels. For sustainable data center operations, the new software for data center operations called Data Center Infrastructure Management (DCIM) has deployed for monitoring and controlling entire system operations that interact among system of systems and human interfaces by deployed condition-based maintenance (CBM) as preventive and predictive conditions. Moreover, CBM performs as long-term cost saving for total cost of ownership (TCO) and energy efficiency.

Keywords

  • data center
  • condition-based maintenance
  • DCIM
  • power distribution systems
  • operations management

1. Introduction

Any system fault of data center is decreasing in total system performance when consider with the minimum requirements of system specification. Therefore, the fault may incur from many reasons such as design error, erroneous installation, machine malfunctions, device defectiveness, miss operations, human error, over operating conditions, or an amalgamation of all of those incidents. In case if the error is not detected within a timely manner and correct response, system failure may happen. Mostly, data center downtime had occurred from cascading failure from devices to sub-system and system. As the results, preventive and predictive mechanism probe to detect the error before it become a failure. The best practice of data center operations, corrective maintenance is not acceptable, for instance in case of New York Stock Exchange in 2015, within the 4 hours downtime, after an upgrade failed, at the stock exchange will result in the consequence of one’s action at least $2.5 million per hour. The data center downtime is not only costly in financial compensated but also ruin reputation that sometime cannot be evaluated. The research from Ponemon Institute [1] reports that the total average cost of data center downtime soared by 38 percent in 2010 from $505,502 to $740,357 per unplanned downtime in 2016. Thereby, to evade these costs of data center downtime; they require deploying more procedures of intensive training and operations, modern maintenance strategies, and experiencing data center’s operators.

Downtime costs are a part of operating expenditure (OPEX) subject to lawsuit or penalty costs that result of any incident. The legal punishment can avoid by PPM approach or called insurance investment, that help reduce TCO in long-term operations. TCO consists of the sum total of operational and capital expenses involved in erecting and maintaining a data center. PPM approaches is not just only protecting downtime costs but also preventing reputation costs of the company that may not be estimated.

The traditional approach to avoid a downtime is applying the action plan through time-based maintenance (TBM). This means that the maintenance team plan for maintenance or upgrade systems by monitoring and controlling up on the schedule time of weeks, months, or annually based on the supplier’s recommend. Moreover, TBM approach prevents the system downtime by following these maintenance schedules; regular inspection, easy to deployment, no condition monitoring needed; decision-maker control (maintenance age or MTBF) maintenance performed when the device reaches MTBF. On the other hand, the condition-based maintenance (CBM) strategic approach relies on an online/offline data collection and continuous measurable condition of devices or systems entirely during they are executing. By applying sensor devices and tools, gathering information that can perform to establish database system for trend analysis, gathering information prediction, and estimated remaining useful lifetime (RUL) of a device or system. The CBM takes action when reaches over the condition of the measurable point that system performance is directly degrading or most likely failure. A prognostic approach of online performance monitoring needs for the throughout degrading processes, from the outset of the system design, installation, operations, and until system failure. This difference approach from scheduled intervals recommends with preventive maintenance.

Since 21st century, the technological advancement, data-driven approach to PDS is predictable and precise. For this reason, many of these data center outages can avoid or mitigate with the properly maintenance approaches and deploying sensing technologies. Predictive maintenance is the complementary of preventive maintenance. Predictive maintenance imposes on the device working condition and tracking operating environment before system breakdown happens. With online condition monitoring system, the predictive maintenance takes action when the deterioration level M reached. (Decision variable: M/threshold deterioration level).

In this research, researcher proposes the preventive and predictive maintenance (PPM) which determines the CBM as systematic strategy of data center operations and maintenance. Use case examples of PDS of data center had examined to ensure their proper functionality and to reduce their deterioration rate. PPM approach can insure devices, sub-systems and systems operating safety, operate as their functional reliability and efficiency, reduce failure rates, and prevent unscheduled downtimes.

Advertisement

2. Background

2.1 Preventive and predictive maintenance

Preventive maintenance implies to regular maintenance or TBM that maintains devices and systems up and operating as normal condition, prevent any unplanned downtime, uneconomical costs from unpredicted system failure, and preserve the operation running efficiency and effectiveness.

CBM comprehends as predictive maintenance. It is a useful mechanism of strategic approach for preventive maintenance that collaborates with monitoring and controlling conditions of critical devices and equipment parameters. This process will operate in order to predict device failure, to assess the RUL, and to avoid system risks, which could be happened if minimum conditions are exceeded. This strategy demonstrates the economical savings over observation of lessons or time-based preventive maintenance, because exertion will execute only when guaranteed.

A RUL defines based on the maintenance policy for single unit deteriorating system that all conditions are continually monitoring with deploying A-B-C analysis to device criticality build up on early successful diagnostics. The A-B-C analysis will diagnose and categorized level of system maintenance into 3 groups; reactive maintenance and excessive repairs and failures; proactive maintenance; and excessive PM and no failure and no repairs [2], as presented in Figure 1.

Figure 1.

Total maintenance related costs.

2.2 Condition-based maintenance

The CBM imposes as the predictive maintenance strategy, which executes device or system maintenance based on setting up conditions, performance, parameter monitoring and the subsequent actions before device or system failures happened. The CBM is a maintenance pattern that advises for maintenance decisions refer to the data and information collecting from condition monitoring system processes. During operating condition, CBM is executing as monitoring appliance through sensing device, which can gauge parameter based on various monitoring attribute s, for example temperature, humidity, vibration, noise levels, contaminants, CO2 and CO scale, and lubricating oil concentration. The usefulness of CBM is the application of the condition monitoring process, where the signals and data are online monitoring by applying many types of sensors inform of wire and wireless technologies. The core of CBM is executing in a real-time assessment of devices and systems conditions in order to analyze all data to perform the decision analysis for maintenance conditions and solutions, while reduces an planned or unplanned downtime, eliminates unnecessary maintenance, and cuts related costs. Thereby, maintenance activities require only when they need after the decision analysis for maintenance conditions such as repairs or replacements before the failure [3].

There are various techniques and technology to implement for data collecting, processing, diagnostics, and prognostics for performing CBM through the system performance operations. Lee (1998) [4] describes CBM strategic approach into three scenarios: data-driven, model-based, and knowledge-based.

First, the data-driven scenario has applied historical and statistical data to comprehend a numerical model of systematic determinants such as mean time between failure (MTBF), mean time to repair (MTTR), and maximum tolerable period of disruption (MTPD) [5]. However, this scenario has depended on the accuracy of sensing devices, operational data, data interpretation, and perceived condition of stressful situation.

Second, model-based scenario has deployed an analytical algorithm such as simulation modeling to demonstrate the system reliability, system degradation, and system efficiency. Mostly, this m0del-based need high-level application software for simulated models such as MATLAB or reliability block diagram (RBD).

Last, knowledge-based scenario has depended on human experience by applying from the past real case based analysis or deriving data from the past project information related to data collecting, gathering, analyzing, decision, and execution. Moreover, they are systematic approach of engineering knowledge and maintenance attention to system facilities to guarantee their proper functions and to reduce their deterioration rate. Sometime knowledge-based can be perform through machine learning or AI in the future.

CBM approaches provisioning load or trend profile the earliest probable prediction of device or system failure, with optimal advantage by reduced maintenance time, labor and inventory costs, eliminated downtime, increased device or system life, and cut capital expenditures. The P-F Curve in Figure 2 depicts the performance condition of device or system, which declines overtime series, this condition leads to functional failure or potential failure. The CBM system is an on-line monitoring, controlling, and inspecting that prepare the greatest P-F Intervals, which are scarcely interrupting than traditional TBM. This helps inspector for a planning for downtime inspections. The process and routine of inspection defines as difference in the length of time manner, therefore it creates the utility of the P-F Interval. The evasion of off-line inspections, which frequently cause of data center downtime and ruin reputation, can apply CBM methods for economically feasibility. The most usually applied techniques of CBM monitoring are:

  • Lubricant Sampling and Analysis

  • Corrosion Monitoring

  • Motor Current Analysis

  • Acoustic Emissions Detection (e.g., ultrasound)

  • Vibration Measurement and Analysis

  • IR Thermography

  • Process Parameter Trending (e.g., flows, rates, pressures, temperatures, etc.)

  • Process Control Instrumentation (measurement and trending)

  • Visual Inspection (look, listen and feel).

Figure 2.

Optimization the P-F interval under CBM method [6].

2.3 Data center reliability

Data center reliability is reinforced by creating redundant topology to each system such as utility supplies, backup power supplies (generators and UPSs), fiber optic communication connections, networking connectivity, environmental controls, and security devices. The report from Emerson [7], as presented in Figure 3, is described some critical devices that related to system failure. The racking top 3 incidents are UPS battery, over capacity of UPS, and human error.

Figure 3.

Root causes and failure analysis inside data center operations.

The prognostics method, the condition monitoring process can be performed either continuously or periodically. Sensing devices and data collection systems may be required for continuous monitoring through DCIM [8, 9]. Graphically, how the prognostics method performs is demonstrated in Figure 4. The deterioration trend of the device condition is represented via the horizontal and vertical axes, which present the operating times, trend monitoring, condition levels, and forecast point respectively. The failure limit line determines the borderline between the operating and failure zones. If the forecasted trend line reaches or exceeds the failure limit, appropriate maintenance may be planned and scheduled ahead of time before the forecast point [10]. The ability to predict the future deterioration trend is the core of the prognostics method in the preventive maintenance strategy.

Figure 4.

The principle of the prognostics method.

PPM can be defined as a strategic approach to improve the availability and reliability performance of a particular data center device or system. CBM is one type of PPM that extrapolates and predicts device or system condition over time, utilizing probability equations to assess and predict the downtime risks.

How to prevent those courses of data center failures? First, redundant system design is the first solution to prevent primary failure while selected devices and systems with highest MTBF rate is other best option. Uptime Tier Classification [11] and BICSI-002 [12] are classified the solution to prevent against the causes of failure. Figure 5 presents the level of prevention of Uptime that Tier 4 is the highest level and Tier 1 is the lowest level of system protection while Table 1 represents the level of prevention of BICSI 002 that Class F0 is the lowest level and Class F4 is the highest level of system protection respectively. The annual allowable planned for maintenance is the crucial factor to prevent data center downtime. For reinforcement of system reliability, the Class F4 and Class F3 are designed for system reliability of PDS for 2(N + 1) and 2 N or N + 1 topology respectively that help more robust on CBM for tolerant maintaining operations with minimal downtime effect to entire system.

Figure 5.

Uptime data center tier classification.

System/class Class F0 Class F1 Class F2 Class F3 Class F4
Description Single path without any one of the following: alternative power source; UPS; proper IT Grounding Single path Redundant component/single path Concurrently maintainable and operable Fault tolerant
Utility Single feed Single feed Single feed 1 source with 2 inputs of 1 source with single input electrically devise from backup generator input Dual feed from different utility substations
Topology N or <N N N + 1 N + 1 2 N, 2(N + 1)
Redundancy No requirement N N N + 1 Greater than N + 1
Generator fuel run time No requirement 8 hrs. 24 hrs. 72 hrs. 96 hrs.
Impact of downtime Sub-local Local Regional Multi-regional Enterprise wide
Annual allowable planned maintenance (hours) >400 100–400 50–99 0–49 0
Availability as % >99.00 99.00–99.90 99.90–99.99 99.99–99.999 99.999–99.9999

Table 1.

BICSI 002 system reliability classification.

Second, how deep to understand consequence of device/system protection of power distribution system. The failure mitigation map illustrates, for each primary failure, the extent to which that failure is mitigated by functional redundancy (or some other design consideration) to prevent it from acting as a single point of failure [13]. A protection design of system reliability for data center can be classified to three stages, which imply as the sources of power protection, as demonstrated in Figure 6.

Figure 6.

Condition failure mode of power distribution systems in data center.

Stage 1: On normal condition, data center is operating with power utility sources as primary power.

Stage 2: On utility outage condition, at short duration with less than <0.5 millisecond to 15 second UPS with flywheel systems can capable handle critical IT loads immediately, while the UPS with battery systems will continuous take action to protect critical IT equipment after flywheel already discharged within 30 seconds. The design capacity of batteries loads is depended on critical IT application and equipment needs, mostly designer or consultant has designed for 15 to 30 minutes. This important information must be given for IT team and data center consultant for calculation design for predicted solution for critical loads [14].

Stage 3: During operation of Stage 2, generator will start after detected utility outage within 12–15 seconds, if the power utilities still not recover on normal function, after generator control sensor detected utility outage within 15 seconds power standby system is already to takeover load from Stage 2 (UPSs).

Last, power distribution system of data center designs for isolating and dividing CBM into 4 groups or zones: Zone 0, Zone I, Zone II, Zone III, and Zone IV, as presented in Figure 7, by:

Figure 7.

Zone preventive approach for CBM.

Zone 0: Utilities (2 N) Preventive Approach, CBM can be performed to utility service level agreement (SLA) and remote monitoring and controlling.

Zone I: Generators 2(N + 1) Preventive Approach, CBM can be performed to software DCIM and main contractor SLA or 3rd parties contract for SLA.

Zone II: UPSs 2(N + 1) Preventive Approach, CBM can be performed to software DCIM and main contractor SLA or 3rd parties contract for SLA.

Zone III: Dual Power Paths (2 N) Preventive Approach, CBM can be performed to software DCIM and main contractor SLA or 3rd parties contract for SLA.

Zone IV: Load Shedding Preventive Approach, CBM can be performed to software DCIM and in house training to handle load shedding (within 10 minutes), main contractor SLA or 3rd parties contract for SLA.

Advertisement

3. Research methodology

The power distribution system (PDS) of data center has exanimated as case studies for this research. They are 4 topology prototypes of Uptime (Tier I, Tier II, Tier III, and IV) and 5 topology prototypes of BICSI (Class F0, F1, F2, F3, F4) of demonstration on operations and maintenance management. Plan-Do-Check-Act (PDCA) has been applied through PPM model. This process has established more data collection from earlier cycles as the same time this process has certified data training for fault diagnostics and prognostics. The fault diagnostics perform through auto-discovery in DCIM software. StruxureWare software [15] had deployed as auto-discovery subject to ability to detect a device, model it and measure that relevant data points of that equipment. PPM approach has examined by system flow diagram (SFD), as depicted in Figure 8.

Figure 8.

PPM system flow diagram of data center operations management.

The SFD begins with data collection from sensing devices at condition monitoring state; data processing and data analytic; feature selection to form statistic modeling before pass through fault diagnostics and prognostics. Output of prognostic process constructs data set and transfers to estimate RUL for input data for predictive maintenance [16]. Predictive maintenance and CBM are synchronized processing with the same data set from RUL and providing data set loopback to the outset of data collection and condition monitoring as plan-do-check-act (PDCA) continuous process. The PPM produces data set for CBM database at the first round and the next rounds will generate data training for fault diagnostics, prognostics and predictive maintenance. CBM can leverage as the strategic approach to guarantee the availability of the entire PDS of data center by monitoring from the device level down as transformers, generators, transfer switches, breakers and switches, UPSs, batteries, PDUs, and PSUs. CBM will manipulate as recursive function of data collection process.

The PDS of data center Tier IV had deliberated as maintenance model management (MMM) for constructing CBM of PDS, as illustrated in single line diagram of Figure 9. The critical devices and systems, which simulate to MMM all data derive from IEEE 493 Gold Book [17] and former research models of Wiboonrat [18, 19].

Figure 9.

Single line diagram of PDS of datacenter tier IV.

The devices and systems list, in Table 2, presents the quantifying characteristics of unit produced per year, number of failure, failures rate per year, MTBF, and MTTR. The following list of power devices in Table 2 (active and supported distribution path) concentrates on the online monitoring data, which desire as input data for CBM and prognostic process for RUL [20].

Category Class Unit/year Failures Failure rate (failures/year) MTBF (hrs.) MTTR (hrs.)
E38-113 Transformer, dry, air cooled, >1500 kVA < =3000 kVA 840.20 0.00 0.00 14,432,242.40 0.00
E36-230 Switchgear, insulated bus, >5 kV, all cabinets, ckt. bkrs. not included 732.50 3.0 0.00 2,139,024.00 37.33
E34-110 Switch, automatic transfer, >600A 690.30 22.00 0.03 274,853.50 1.64
E18-121 Diesel engine generator, packaged, 250 kW to 1.5 MW, continuous 266.00 115.00 0.58 15,033.80 25.74
E39-200 UPS, small computer room floor 426.40 4.00 0.01 933,708.00 2.00
E2-120 Battery, lead acid, strings 3215.30 24.00 0.01 1,173,590.30 32.13
E36-210 Switchgear, insulated bus, <=600 V, all cabinets, ckt. bkrs. not included 322.70 0.00 0.00158 5,543,247.10 0.00

Table 2.

IEEE 493 active equipment MTBF.

The power reliability assessment of PDS needs to measure throughout the overall statuses of the PDS devices and systems of data center that comprise as the following [21]:

  • Transformer

  • Entrance switchgear

  • Automatic transfer switch (ATS)

  • Diesel generator

  • Uninterruptable power supply (UPS)

  • Leaded acid batteries

  • Distribution switchgear

  • Power distribution unit (PDU)

  • Rack-Power supply unit (PSU)

The capacity analysis of power systems has investigated to diagnose and analyze of all power devices and systems as above list. The MMM designs to perform as PPM of PDS of data center Tier IV. All critical devices have been derived data set of MTBF and MTTR from IEEE 493 [17] for each category as represented in Table 1. This method is defined the set-points of P-F curve according to the points where failure starts to occur and point where operators can find out that devices or systems are revealed the failing point (potential failure) because CBM is moving point P (potential failure) to the earliest time possible, the condition is to maximize the P-F interval [22].

Advertisement

4. Preventive and predictive maintenance

According to the data center operations and maintenance under PPM, online condition-monitoring systems are the best scenario by deploying DCIM software. The DCIM design of the PDS is option from reducing long-term operating costs and complexity. The efficient DCIM is being evolution to the automatic processes as the critical success factor for maintaining downtime. By self-diagnosis of DCIM, PDS devices and systems can track age, operating hours, working statuses, warning alarms, MTBF, MTTR, and the last modified or upgraded by who and when.

In this deliberation, researcher has installed StruxureWare [15], a DCIM software from Schneider Electric as sensing instrument for data collection. StruxureWare performs as points of online data collection by measuring all values at set points on the devices or systems, as shown in Figure 10. These data are online and real-time verifying with outset-determined data from CBM database to impose the critical levels as basic criteria. Control levels (before critical level) are ordinarily imposed for apprising automatic warnings before system shutdown. The types of automatic warning are depending on the severe consequence of the cascading failure. It has a process to send warning message to each personal mobile or e-mail by configuration. The foundation of StruxureWare is relied on transducers, sensors, networking and intelligent electronic devices (IED) for collecting data throughout the PDS in data center devices [23].

Figure 10.

Data collection from PDS of data center tier IV.

4.1 CBM model for StruxureWare (DCIM)

Tracking the increasing probability of future failure of device or system is primary function of CBM. Extrapolating and predicting system condition over time will help to analyze particular devices that could possibly to have defects requiring repairs. A CBM method also diagnoses, through statistics and data, which devices or systems most likely will remain in acceptable condition without the requirement for maintenance.

Since, Uptime Institute [11] and BICSI [12] have defined the data center Tier IV and Class F4 as the standard design for the data center site availability at 99.995 percent. The investigation of system reliability of PDS data center is an objective for this research model. Researcher has designed 12 sensor points by installed IED devices for data collection points throughout the PDS of data center [24]. The StruxureWare had installed and applied the concept of CBM to verify PDS of data center in only one single line diagram. Each devices and systems are differed functions in electrical and mechanical design proposes. Therefore, each device and system needs different location for installing and collecting data at the level of physical contact. All IED data collection must be measured in term of instantaneous and trending of all electrical status such as voltage, amperage, phase, total harmonic distortion (THD); and mechanical status; alarm, vibration, noise, temperature, leakage, oil level or other status; equipment aging, run-time, failure history, degradation percentage, abnormal events [25], as presented in Figure 10.

This CBM design proposes for extending P-F interval. StruxureWare shows data collecting from the last point at critical application server zone or Rack PSU, as depicted in Figure 11.

Figure 11.

Data collection from rack PSU of data center tier IV by StruxureWare.

This helps data center administrator realizes the current power conditions when compares (Left PSU is 0.5 kW and Right PSU is 1.3 kW) to the maximum power capacity of each rack (4 kW) such as voltage, ampere, frequency, phase balance, temperature of the rack, space of rack available, and the last time audit. Moreover, this monitor from the device level up, from PSUs of each server to discover idle servers that are quietly draining power and taking up space.

The research presuppositions are:

  1. If the failure status befall after device aging or MTBF and StruxureWare has detected and the administrator team can repair it before component failure, thereby system failure cannot be occurred

  2. If the failure status befall before device aging or MTBF and StruxureWare detected and the administrator team can repair it before component failure, thereby system failure cannot be occurred

  3. Replacement of parts, changing lubrication or changing spare parts could be executed during operations as supplier’s recommendation for critical devices without interrupting system operations (Concurrent Maintenance)

  4. Extending aging for non-critical devices benefits when move point P (potential failure) to the earliest time possible maximizing the P-F interval before it has failed (functional failure).

4.2 Value and status of data collections

4.2.1 Value and status from condition monitoring systems

Field data collection is the beginning of CBM process. As the single line diagram of PDS of data center appointed 12 equipment installations for StruxureWare by set-point value as specify in Table 1, and status monitoring as specify in Table 3.

Components/systems Infrared thermography Precise timing and trending Visual inspection Insulation resistance Motor circuit analysis Polarization index/dissipation factor Cable condition monitoring Oil and gas levels Vibration monitoring Lubricant analysis Wear particle analysis Bearing temperature analysis Leakage detection Performance monitoring Ultrasonic monitoring
1 Transformer
2 Entrance switchgear
3 Automatic transfer switch (ATS)
4 Diesel generator
5 Uninterruptable power supply (UPS)
6 Leaded acid batteries
7 Distribution switchgear
8 Power distribution unit (PDU)
9 Rack-PDU

Table 3.

Values and status of data collection from condition monitoring systems.

The maintenance set-point value at the beginning refers from IEEE 493, MTBF, plus condition of P-F interval. Mostly, device status condition comes from supplier data sheet’s for maintenance. Both of data collection sources are sending to StruxureWare, which intends for manipulating after; condition monitoring and data collection process; and data processing and signal processing. DCIM will execute function selection as operator’s requirement and create statistic modeling for fault diagnostics and prognostics for calculating RUL. All data collection will input through the predictive maintenance function for setting up the new value and status as the beginning of condition monitoring, PDCA process, as represented in Table 3. Almost 12 months of data collection by StruxureWare and PPM model, there are no blackout in PDS of data center Tier IV. No blackout does imply no any device or system failure but Tier IV topology designs as fully redundancy 2(N + 1), therefor, some devices or systems can be failure but the other still perform without system interruption. The StruxureWare can detect and discover before sending information to administrator team to repair it under MTTR condition. Because data center Tier III is designed as 2 N and Tier IV is designed as 2(N + 1) topology. It allows more fault tolerance to devices and systems failure. The system warning occurs a few times but data center administrator can fix the problems by warning instruction from StruxureWare monitor guides. The StruxureWare has designed for easing to understand and predict any device or system failure and resolve it before it fails, which implies CBM help decrease planned and unplanned downtime, labor hours, and spare part inventory, while increases throughput of system productivity. Moreover, CBM supports the provision and early warning system for all devices and systems failure functions, StruxureWare has capable to controls inventory level much more effectively and no need as many emergency spare parts [26].

4.2.2 Value and status from idle servers

Idle server is a physical server that is still running but has no perform any computing resources or any transaction processing, that it consumes power but is serving no useful purpose. The Uptime Institute survey reports around 30 percent of global data center servers are either underutilization or completely idle. This server can consume power an impressive 175 watts when it is idle mode. A survey of server PSUs [27] reports the range of efficiency related to load of PSUs, as illustrated in Figure 12.

Figure 12.

Power supply efficiency.

In the red zone, power loaded of PSU is lower than 20 percent the efficiency drops off precipitously. In the yellow zone, 20–40 percent, PSU efficiency begins to drop but typically exceeds 70 percent. In the green zone, the PSU operates above 40 percent loaded, where their efficiency is at or above 80 percent. At idle mode, current servers still draw power about 60 percent of peak load electricity. In normal data center operations, average server utilization is only 20–30 percent [27]. Now data center operators deal with growing cost restraints and energy efficiency goals, it is become primal objective to identify and eliminate these severs promptly. Table 4 shows the saving costs due to idle power draw of each server per year compare to range of cost of electricity per kW/hour.

Power supply size (Watts) 400 400 400 400 400
Idle power draw (kW) 0.6 0.6 0.6 0.6 0.6
Power waste (Watts) 240 240 240 240 240
Hours per year 8760 8760 8760 8760 8760
Cost of electricity per kW/hr ($) 0.08 0.1 0.12 0.14 0.15
Savings ($) 168.19 210.24 252.29 294.34 315.36

Table 4.

Idle server and electricity costs.

Locating and identifying an idle server is performed function through DCIM solution. The DCIM applies database from field data collection is the beginning of CBM process at device level of PSUs and PDUs. The DCIM and intelligent PDU can give data center operator the insights which data need to gain complete control of power usage, load profile or utilization of servers, and cost-efficiency IT environment.

Advertisement

5. Results and discussion

After design the single line diagram of PDS, in Figure 10, all main devices and systems had monitoring through IT sensing devices such as transformer, entrance switchgear, automatic transfer switch (ATS), diesel generator, uninterruptable power supply (UPS), leaded acid batteries, distribution switchgear, power distribution unit (PDU), and rack-Power supply unit (PSU), for measurement of the instantaneous and trending of all electrical status; voltage, amperage, phase, total harmonic distortion (THD); and mechanical status; alarm, vibration, noise, temperature, leakage, oil level or other status. All data collection had recorded through DCIM system for define set-point or condition-based maintenance (CBM) of each critical device and system to prevent potential failure or P-F Curve. The results from installed and operations data center with StruxureWare software show system warning of DCIM reduce data center operator time in day-by-day to fine out root causes of the problems such as location of devices or systems, history condition of operations device, with device is broken first and cascading failure to which system, and more easy for operator to make decision with completely information for future provision.

Advertisement

6. Conclusions

Total cost of ownership (TCO) is an excellent measure of the value of data center uptime. System uptime is momentous for the success of mission crucial for data center business. More data center uptime defines lower operating costs and higher customer satisfaction and trust. Data center downtime leads to high TCO due to issues such as increased penalty costs, recovery data and systems costs, and reputation costs. The data center Tier IV proposes for high system reliability by applying fault tolerance topology or fully redundancy 2(N + 1) strategy. Consequently, during operations and maintenance they needs fully fault protection from system failure. Therefore, preventive and predictive maintenance (PPM) has considered for monitoring and detecting all possible potential devices and systems failures before data center failure happened. In this research chapter, The StruxureWare as a DCIM software has deployed for PPM model to eliminate PDS downtime and trace the idle servers. The benefits of data center system maintenance when deployed DCIM properly are reduced downtime costs, increased uptime productivity, eased for online and real time management, reduced inventory costs, reduced fix costs in long-term operations and maintenance. The condition-based maintenance (CBM) has the advantage to deal with 2 crucial determinants, detecting error or faults before devices or systems failure (MTBF) and predicting the time between maintenance processes and time to repair (MTTR) while impacts on saving penalty costs of downtime, saving labor hours, inventory costs, increasing data center uptime, and reducing overall TCO.

References

  1. 1. Ponemon Institute, Cost of Data Center Outages: Data Center Performance Benchmark Series, Sponsored by Vertive, January, 2016.
  2. 2. Vitucci F, Predictive maintenance in a mission critical environmant, Emerson: White Paper CSI 2130 Machinery Health Analyzer, 2017
  3. 3. Keizer O, Teunter H, Veldman J, Babai Z, In: Condition-based maintenance for systems with economic dependence and load sharing Int. J. of Production Economics, vol.195, 2018. p.319-327
  4. 4. Lee J, Teleservice engineering in manufacturing: challenges and oppor- tunities. In: International Journal of Machine Tools & Manufacture, 38 (8), 1998. p. 901–910
  5. 5. Sobral J, Soares G, Preventive maintenance of critical assets based on degradation mechanisms and failure forecast, IFAC-Paper Online, 49-28, 2016. p. 97-102
  6. 6. Blann R, Maximizing the P-F Interval Through Condition-Based Maintenance, http://www.maintworld.com/Applications/Maximizing-the-P-F-Interval-Through-Condition-Based-Maintenance, 2018
  7. 7. Emerson, Addressing the Leading Root Causes of Downtime, White Paper: SL-24656-R10-10, Liebert Corporation, 2010.
  8. 8. Shin H, Jun B, On condition based maintenance policy, In: J. of Computational Design and Engineering, vol.2, 2015, p.119-127
  9. 9. Compare M, Bellani L, Zio E, Reliability model of a device equipped with PHM capabilities, Reliability Engineering and System Safety, vol.168, 2017, p.4-11
  10. 10. Jonge B, Teunter R, Tinga T, The influence of practical factors on the benefits of condition-based maintenance over time-based maintenance, Reliability Engineering and System Safety, vol.158, 2017, p.21-30
  11. 11. Uptime Institute, Data Center Site Infrastructure Tier Standard: Topology. Uptime Institute Professional Services, LLC. Uptime Institute, 2014.
  12. 12. BICSI-002, Data Center Design and Implementation Best Practices, BICSI 002-2014, 2014.
  13. 13. Martorell P, Marton I, Sanchez I, Martorell S, Unavailability model for demand-caused failures of safety devices addressing degradation by demand-induced stress, maintenance effectiveness and test efficiency, Reliability Engineering and System Safety, vol.168, 2017, p.18-27
  14. 14. Keizer O, Teunter H, Veldman J, Joint condition-based maintenance and inventory optimization for systems with multiple devices, In: European J. of Operational Research, vol.257, 2017, p.209-222
  15. 15. StruxureWare, StruxureWare Data Center Operation 8, Schneider Electric, 2018
  16. 16. Gregory V, Brian W, Moneer H, A review of diagnostic and prognostic capabilities and best practices for manufacturing. In: J. Intell. Manuf., 2016. p.1-17
  17. 17. IEEE 493, IEEE Std. 493-2007 (Revision of IEEE 493-1997), Recommended Practice for Design of Reliable Industrial and Commercial Power System, Gold Book, 2007.
  18. 18. Wiboonrat M, Transformation of system failure life cycle, In: Int. J. of Management Science and Engineering Management, Vol. 4, No. 2, 2008, p.143-152
  19. 19. Wiboonrat M, An empirical study on data center system failure diagnosis, In: The 3rd Int. Conf. on Internet Monitoring and jProtection, 2008, p. 103-108
  20. 20. Shin J, Jun H, On condition based maintenance policy, 2, 2015, p. 119-127
  21. 21. Cole D, Developing a priventive maintenance plan for your data center, White Paper, PTS Data Center Solution, Rev 2010-1.1, 2010
  22. 22. Vogl W, Weiss A, Helu M, A review of diagnostic and prognistic capabilities and best practices for manufacturing, In: J. Intelligent Manufacturing, 2016
  23. 23. Zavoda F, Sensors and IEDs required by smart distribution applications: Smart grid and distribution automation, In: The 1st Int. Conf. on Smart Grids, 2011, p. 120-125
  24. 24. Hor C, Crossley A, Knowledge Extraction from Intelligent Electronic Devices, Springer-Verlag, Berlin Heidelberg, 2005
  25. 25. Bayle T, Preventive maintenance strategy for data centers, Schneider Electric, White paper 124, Rev 1, 2011
  26. 26. ECOS, EPR, Efficient power supplies for data center, ECOS and EPR, Tech. Rep., Feb. 2008.
  27. 27. Barroso L, Holzle L, The case for energy-proportional computing, IEEE Computer. Jan, 2007

Written By

Montri Wiboonrat

Submitted: 30 May 2020 Reviewed: 09 September 2020 Published: 26 October 2020