Effective and efficient maintenance is a significant factor in operation of today’s complex computer systems. Selecting the optimal maintenance strategy must take numerous issues into account and among them reliability and economic factors are often of equal importance. On one side, it is obvious that for successful system operation failures must be avoided and this opts for extensive and frequent maintenance activities. On the other, superfluous maintenance may result in very large and unnecessary cost. Finding a reasonable balance between these two is a key point in efficient system operation.
This text describes Asset Risk Manager (ARM) – a computer software package provided as a decision support tool for a person selecting maintenance activities. Its main task is to help in evaluation of risks and costs associated with choosing different maintenance strategies. Rather than searching for a solution to a problem: “what maintenance strategy would lead to the best dependability parameters of system operation”, in our approach different maintenance scenarios can be examined in “what-if” studies and their reliability and economic effects can be estimated.
The main idea of the approach is based on the concept of a life curve and discounted cost used to study the effect of equipment ageing under different maintenance policies. First, the deterioration process in the presence of maintenance activities is described by a Markov model and then its various characteristics are used to develop the equipment life curve and to quantify other reliability parameters. Based on these data, effects of various “what-if” maintenance scenarios can be visualized and their efficiency compared. Simple life curves computed from the model can be combined to represent equipment deterioration undergoing diverse maintenance actions, while computing other parameters of the model allows evaluating additional factors, such as probability of equipment failure.
Special care is paid to one particular problem: having a model that describes the deterioration of an element that undergoes some maintenance policy with particular repair frequencies, it is often needed to create a model representing the same element being subjected to a new policy that differs only in repair frequencies. The method proposed for creation of such a model adjusts the initial one through fine-tuning probabilities of the repair states in an iterative process that converges to the desired goal. Discussion of different possible approximation methods applied during the adjustment is included and effectiveness of this approach is illustrated with practical examples.
The ARM system itself has been initially presented in (Anders & Sugier, 2006). This text extends that presentation with additional discussion of the method for Markov model adjustment and its impact on new results that can be included in the studies (Sugier & Anders, 2007).
2. Modelling the ageing process in the presence of maintenance activities
In the proposed approach it is assumed that the equipment will deteriorate in time and, if not maintained, will eventually fail. If the deterioration process is discovered, preventive maintenance is performed which can often restore the condition of the equipment. Such a maintenance activity will return the system to a specific state of deterioration, whereas repair after failure will restore to “as new” condition (Hughes & Russell, 2005, Anders & Endrenyi, 2004).
Markov models, which form the underlying structure of the models investigated here, have been applied during planning and operation of large networks (IEEE/PES Task Force, 2001). Equipment aging processes with non-exponential time of sojourn in the states can be represented by several series of stages (Li & Guo, 2006). Each stage can be represented as a state in the Markov process so that the non-Markovian processes can be transformed into Markovian processes (IEEE/PES Task Force, 2001; Singh & Billinton, 1997; Tomasevicz & Asgarpoor, to be published). Fuzzy Markov models have also been developed in which uncertainties in transition rates / probabilities are represented by fuzzy values (Mohanta et al., 2005, Duque & Morinigo, 2004, Cugnasca et al., 1999, Ge et al., 2007). In these models, fuzzy arithmetic was applied to mimic the crisp Markov process calculations which are computationally tedious and even more so when the number of states increases.
2.1. The life curves
A convenient way to represent the deterioration process is by
2.2. The ageing process
There are three major factors that contribute to the ageing behaviour of equipment: physical characteristics, operating practices, and the maintenance policy. Of these three aspects the last one relates to events and actions that should be properly incorporated in the model.
The maintenance policy components that must be recognized in the model are: monitoring or inspection (how is the equipment state determined), the decision process (what determines the outcome of the decision), and finally, the maintenance actions (or possible decision outcomes).
In practical circumstances, an important requirement for the determination of the remaining life of the equipment is the establishing its current state of deterioration. Even though at the present state of development no perfect diagnostic test exists, monitoring and testing techniques may permit approximate quantitative evaluation of the state of the system. It is assumed that four deterioration states can be identified with reasonable accuracy: (a) normal state, (b) minor deterioration, (c) significant (or major) deterioration, and (d) equipment failure. Furthermore, the state identification is accomplished through the use of scheduled inspections. Decision events generally correspond to inspection events, but can be triggered by observations acquired through continuous monitoring. The decision process will be affected by what state the equipment is in, and also by external factors such as economics, current load level of the equipment, its anticipated load level and so on.
2.3. The model
All of the above assumptions about the ageing process and maintenance activities can be incorporated in an appropriate state-space (Markov) model. It consists of the states the equipment can assume in the process, and the possible transitions between them. In a Markov model the rates associated with the transitions are assumed to be constant in time.
The development described in this paper uses model of Asset Maintenance Planner (Anders & Maciejewski, 2006, Anders & Leite da Silva, 2000). The AMP model is designed for equipment exposed to deterioration but undergoing maintenance at prescribed times. It computes the probabilities, frequencies and mean durations of the states of such equipment. The basic ideas in the AMP model are the probabilistic representation of the deterioration process through discrete stages, and the provision of a link between deterioration and maintenance.
For structure of a typical AMP model see Fig. 2. In most situations, it is sufficient to represent deterioration by three stages: an initial (D1), a minor (D2), and a major (D3) stage. This last is followed, in due time, by equipment failure (F) which requires extensive repair or replacement.
In order to slow deterioration and thereby extend equipment lifetime, the operator will carry out maintenance according to some pre-defined policy. In the model of Fig. 2, regular inspections (I
The choice probabilities (at the points of decision making) and the probabilities associated with the various possible outcomes are based on user input and can be estimated e.g. from historical records or operator expertise. For the needs of further tuning of the model the probabilities linked to transitions to the maintenance states M
Mathematically, the model in Fig. 2 can be represented by a Markov process, and solved by well-known procedures. The solution will yield all the state probabilities, frequencies and mean durations. Another technique, employed for computing the so-called first passage times (FPT) between states, will provide the average times for first reaching any state from any other state. If the end-state is F, the FPT’s are the mean remaining lifetimes from any of the initiating states.
3. Adjusting model parameters
Preparing the Markov model for some specific equipment is not an easy task and requires participation of an expert. The goal is to create the model representing closely real-life deterioration process known from the records that usually describe average equipment operation under regular maintenance policy with some specific frequencies of inspections and repairs. Compliance with these frequencies in behaviour of the model is a very desirable feature that verifies its trustworthiness.
This section describes a method of model adjustment that aims at reaching such a compliance (Sugier & Anders, 2007). It can be used also for a different task: fully automatic generation of a model for a new maintenance policy with modified frequencies of repairs.
3.1. The method
Given an initial Markov model
Typically, the vector FG represents observed historical values of the frequencies of various repairs. In the proposed solution, a sequence of tuned models
1 For the current model
2 Evaluate an error of
3 If the error is within the user-defined limit consider
4 Create model
5 Go to step 1 and proceed with the next iteration.
The error computed in step 2 can be expressed in may ways. As the frequencies of repairs may vary in a broad range within one vector F
The latter formula is more restrictive and was used in examples of this work.
3.2. Approximation of model probabilities
Of all the steps outlined in the previous section, it is clear that adjusting probabilities
In general, the probabilities represent
for all repairs (
This assumption also significantly reduces dimensionality of the problem, as now only a vector of
Moreover, although frequency of a repair
With these assumptions generation of a new model in step 4 is reduced to finding roots of
For the needs of development described in this work the following three approximation algorithms has been implemented and verified on practical examples: (A) Newton method working on linear approximation of
(A) Newton method On Linear Approximation (NOLA)
In this solution it is assumed that
Applying these factors to all repair probabilities creates the next model
This method is very simple and may seem primitive but its noteworthy advantage lies in the fact that no other point than the current frequency
(B) The secant method
In this standard technique the function is approximated by the secant defined by the last two approximations in points
To start the procedure two initial points are needed. In this method it is proposed to choose the initial frequency of the model
(C) The false position (
In this approach
As in (B), to begin the iteration the two initial points are needed but now they must lie on both sides of the root, i.e.
Choosing such points may pose some difficulty. To avoid multiple sampling, it is proposed to select
3.3. Comparison of the methods
We shall now discuss effectiveness of the above three approximation methods using a sample Markov model tuned for four different repair frequencies. The final result presented to the user - life curves that were obtained from the tuned models - is shown in Fig. 3.
The model of the equipment consisted of 3 deterioration states and 3 repairs (
case 2: frequencies of all repairs were reduced by half, FG = ½ F0
case 3: all repairs but major (
case 4: frequencies of all repairs were reduced to 25%, FG = ¼ F0
case 5: all repairs were removed, FG = [ 0, 0, 0 ].
All the three approximation methods (NOLA, secant and
What also can bee seen for this specific case in Fig. 4 is that the three approximation methods, although significantly different from the mathematical point of view, yield very similar results during the first iterations 1 3. The difference becomes visible starting from iteration no. 4 when, apparently, the secant method generated the approximating point that did not met condition (6) and, effectively, lost this iteration reaching accuracy of the
Comparing the effectiveness of the methods it should be noted that although simplifications of the NOLA solution may seem critical, in practice it works quite well. As it was noted before, this method has one advantage over its more sophisticated rivals: since it does not depend on previous approximations, selection of the starting point is not so important and the accuracy during the first iterations is often better than in the secant or
4. Asset Risk Manager
The Asset Risk Manager (ARM) is a software package which uses the concept of a life curve and discounted cost to study the effect of equipment ageing under different hypothetical maintenance strategies (Anders & Sugier, 2006). The curves generated by the program are based on Markov models that were presented in the two previous sections.
For the program to generate automatically the life curves, default Markov model for the equipment has to be built and stored in the computer database. This is done through the prior running of the AMP program by an expert user. Therefore, both AMP and ARM programs are closely related, and usually, should be run consecutively.
Implementation details of Markov models, tuning its parameters and all other internal particulars should not be visible to the non-expert end user. All final results are visualized either through an easy to comprehend idea of a life curve or through other well-known concepts of financial analysis. Still, prior to running the analysis some expert involvement is needed, largely in preparation, importing and adjusting AMP models.
4.1. User input
A typical study is described through a comprehensive set of parameters that are supplied by an non-expert end user. They fall into three broad categories.
continue as before (i.e. do not change the present policy),
do nothing (i.e. stop all the repairs),
Apart from the first type, every action can be delayed for a defined amount of time. Additionally, for “non-empty” actions (i.e. any of the last two types) user must specify what to do in the period after action; the choices are:
(a) to change type of equipment and / or
(b) to change maintenance policy.
For every action user must also specify what to do in case of failure: whether to repair or replace failed equipment, its condition afterwards, cost of this operation etc. Thanks to these options a broad range of maintenance situations can be described and then analysed.
The first action on the list is always “Continue as before” and this is the base of reference for all the others. The ARM can be directed to compute life curves, cost curves, or probabilities of failure – for each action independently – and then to visualize computed data in many graphical forms to assist the decision-maker in effective action assessment.
It should be noted that while the need for some action (e.g., overhaul or change in maintenance policy) is identified at the present moment, the actual implementation will usually take place only after a certain delay during which the original maintenance policy is in effect. Using ARM it is possible to analyze effect of that delay on the cost and reliability parameters.
4.2. Life curves
As it has been pointed out before, computing the average first passage time (FPT) from the first deterioration state (D1) to the failure state (F) in the Markov model yields the average lifetime of the equipment, i.e. length of its life curve. On the other hand, solving the model for state probabilities of all consecutive deterioration states makes possible computing state durations, which in turns determine shape of the curve. Simple life curves obtained for different maintenance policies are later combined in constructing composite life curves which describe various maintenance scenarios.
For sake of simplicity and consistency, always exactly three deterioration states, or levels, are presented to the end user: minor, medium and major, with adjustable AC ranges. In case of Markov models which have more than three D
Fig. 6 shows exemplary life curves computed by ARM for typical maintenance situations. In each case the action is delayed for 3 time units (months, for example) and the analysis is performed for a time horizon of 10 time units. In case of failure seen in “Do nothing” action, equipment is repaired and its condition is restored to 85%.
4.3. Probability of failure
For a specific action, probability of failure within the time horizon (
(1) For initial asset condition, find from the life curve the current deterioration state DSn; compute also state progress (SP, %), i.e. estimate how long the equipment has been in the DSn state.
(2) Running FPT analysis on the model, find distributions Dn and Dn+1 of first passage time from DSn and DSn+1 to the failure state F.
(3) Taking state progress into account, probability of failure is evaluated as
For better visualization, rather than finding a single
4.4. Cost curves
In many financial evaluations, the costs are expressed as present value (PV) quantities. The present value approach is also used in ARM because maintenance decisions on ageing equipment include timing, and the time value of money is an important consideration in any decision analysis. The cost difference is often referred to as the Net Present Value (NPV). In the case of maintenance, the NPV can be obtained for several re-investment options which are compared with “Continue as before” policy.
Cost computations involve calculation of the following cost components:
cost of maintenance activities,
cost of the action selected (e.g. refurbishment or replacement),
cost associated with failures (cost of repairs, system cost, penalties, etc.).
To compute the PV, inflation and discount rates are required for a specified time horizon. The cost of maintenance over the time horizon is the sum of the maintenance costs incurred by the original maintenance policy for the duration of the delay period, and the costs incurred by the new policy for the remainder of the time horizon. The costs associated with equipment failure over the time horizon can be computed similarly except that the failure costs before and after the action is multiplied by the respective probabilities of failures (
The purpose of the ARM tool is to help in choosing effective maintenance policy. Based on Markov models representing maintenance actions and deterioration processes, life curves and other reliability parameters can be evaluated. Once a database of equipment models is prepared, the end-user can perform various studies about different maintenance strategies and compare expected outcomes. Since the equipment condition is visualized through the relatively simple concept of a life curve, no detailed expert knowledge about internal reliability parameters or configuration is required.
The system can also automatically adjust the model to requested repair frequencies and thus provides for fully automatic computation of dependability parameters in cases when maintenance policy needs to be modified within some range. This also reduces the model preparation time that requires involvement of the reliability expert and allows for broader range of studies that can be done fully automatically by the end user.