## 1. Introduction

Effective and efficient maintenance is a significant factor in operation of today’s complex computer systems. Selecting the optimal maintenance strategy must take numerous issues into account and among them reliability and economic factors are often of equal importance. On one side, it is obvious that for successful system operation failures must be avoided and this opts for extensive and frequent maintenance activities. On the other, superfluous maintenance may result in very large and unnecessary cost. Finding a reasonable balance between these two is a key point in efficient system operation.

This text describes Asset Risk Manager (ARM) – a computer software package provided as a decision support tool for a person selecting maintenance activities. Its main task is to help in evaluation of risks and costs associated with choosing different maintenance strategies. Rather than searching for a solution to a problem: “what maintenance strategy would lead to the best dependability parameters of system operation”, in our approach different maintenance scenarios can be examined in “what-if” studies and their reliability and economic effects can be estimated.

The main idea of the approach is based on the concept of a life curve and discounted cost used to study the effect of equipment ageing under different maintenance policies. First, the deterioration process in the presence of maintenance activities is described by a Markov model and then its various characteristics are used to develop the equipment life curve and to quantify other reliability parameters. Based on these data, effects of various “what-if” maintenance scenarios can be visualized and their efficiency compared. Simple life curves computed from the model can be combined to represent equipment deterioration undergoing diverse maintenance actions, while computing other parameters of the model allows evaluating additional factors, such as probability of equipment failure.

Special care is paid to one particular problem: having a model that describes the deterioration of an element that undergoes some maintenance policy with particular repair frequencies, it is often needed to create a model representing the same element being subjected to a new policy that differs only in repair frequencies. The method proposed for creation of such a model adjusts the initial one through fine-tuning probabilities of the repair states in an iterative process that converges to the desired goal. Discussion of different possible approximation methods applied during the adjustment is included and effectiveness of this approach is illustrated with practical examples.

The ARM system itself has been initially presented in (Anders & Sugier, 2006). This text extends that presentation with additional discussion of the method for Markov model adjustment and its impact on new results that can be included in the studies (Sugier & Anders, 2007).

## 2. Modelling the ageing process in the presence of maintenance activities

In the proposed approach it is assumed that the equipment will deteriorate in time and, if not maintained, will eventually fail. If the deterioration process is discovered, preventive maintenance is performed which can often restore the condition of the equipment. Such a maintenance activity will return the system to a specific state of deterioration, whereas repair after failure will restore to “as new” condition (Hughes & Russell, 2005, Anders & Endrenyi, 2004).

Markov models, which form the underlying structure of the models investigated here, have been applied during planning and operation of large networks (IEEE/PES Task Force, 2001). Equipment aging processes with non-exponential time of sojourn in the states can be represented by several series of stages (Li & Guo, 2006). Each stage can be represented as a state in the Markov process so that the non-Markovian processes can be transformed into Markovian processes (IEEE/PES Task Force, 2001; Singh & Billinton, 1997; Tomasevicz & Asgarpoor, to be published). Fuzzy Markov models have also been developed in which uncertainties in transition rates / probabilities are represented by fuzzy values (Mohanta et al., 2005, Duque & Morinigo, 2004, Cugnasca et al., 1999, Ge et al., 2007). In these models, fuzzy arithmetic was applied to mimic the crisp Markov process calculations which are computationally tedious and even more so when the number of states increases.

### 2.1. The life curves

A convenient way to represent the deterioration process is by *a life curve* of the equipment (Anders & Endrenyi, 2004). Such a curve shows the relationship between asset condition, expressed in either engineering or financial terms, and time. Since there are many uncertainties related to the prediction of equipment life, probabilistic analysis must be applied to construct and evaluate life curves. Fig. 1 (a) shows an example of a simple life curve of some equipment that models its continuous deterioration up to the point of failure. Fig. 1 (b) illustrates application of this curve in a case study of some specific scenario in which equipment refurbishment and equipment failure occur.

### 2.2. The ageing process

There are three major factors that contribute to the ageing behaviour of equipment: physical characteristics, operating practices, and the maintenance policy. Of these three aspects the last one relates to events and actions that should be properly incorporated in the model.

The maintenance policy components that must be recognized in the model are: monitoring or inspection (how is the equipment state determined), the decision process (what determines the outcome of the decision), and finally, the maintenance actions (or possible decision outcomes).

In practical circumstances, an important requirement for the determination of the remaining life of the equipment is the establishing its current state of deterioration. Even though at the present state of development no perfect diagnostic test exists, monitoring and testing techniques may permit approximate quantitative evaluation of the state of the system. It is assumed that four deterioration states can be identified with reasonable accuracy: (a) normal state, (b) minor deterioration, (c) significant (or major) deterioration, and (d) equipment failure. Furthermore, the state identification is accomplished through the use of scheduled inspections. Decision events generally correspond to inspection events, but can be triggered by observations acquired through continuous monitoring. The decision process will be affected by what state the equipment is in, and also by external factors such as economics, current load level of the equipment, its anticipated load level and so on.

### 2.3. The model

All of the above assumptions about the ageing process and maintenance activities can be incorporated in an appropriate state-space (Markov) model. It consists of the states the equipment can assume in the process, and the possible transitions between them. In a Markov model the rates associated with the transitions are assumed to be constant in time.

The development described in this paper uses model of Asset Maintenance Planner (Anders & Maciejewski, 2006, Anders & Leite da Silva, 2000). The AMP model is designed for equipment exposed to deterioration but undergoing maintenance at prescribed times. It computes the probabilities, frequencies and mean durations of the states of such equipment. The basic ideas in the AMP model are the probabilistic representation of the deterioration process through discrete stages, and the provision of a link between deterioration and maintenance.

For structure of a typical AMP model see Fig. 2. In most situations, it is sufficient to represent deterioration by three stages: an initial (D1), a minor (D2), and a major (D3) stage. This last is followed, in due time, by equipment failure (F) which requires extensive repair or replacement.

In order to slow deterioration and thereby extend equipment lifetime, the operator will carry out maintenance according to some pre-defined policy. In the model of Fig. 2, regular inspections (I*s*) are performed which result in decisions to continue with minor (M*s*1) or major (M*s*2) maintenance or do nothing (with the state number *s* = 1, 2 or 3). The expected result of all maintenance activities is a single-step improvement in the deterioration chain; however, allowances are made for cases where no improvement is achieved or even where some damage is done through human error in carrying out the maintenance resulting in the next stage of deterioration.

The choice probabilities (at the points of decision making) and the probabilities associated with the various possible outcomes are based on user input and can be estimated e.g. from historical records or operator expertise. For the needs of further tuning of the model the probabilities linked to transitions to the maintenance states M*si* are the most important ones as they are directly related to the repair frequencies. These probabilities will be denoted as P^{ sr }(P^{11}, P^{12}, … P^{32}), where *s* = state number and *r* = repair index.

Mathematically, the model in Fig. 2 can be represented by a Markov process, and solved by well-known procedures. The solution will yield all the state probabilities, frequencies and mean durations. Another technique, employed for computing the so-called first passage times (FPT) between states, will provide the average times for first reaching any state from any other state. If the end-state is F, the FPT’s are the mean remaining lifetimes from any of the initiating states.

## 3. Adjusting model parameters

Preparing the Markov model for some specific equipment is not an easy task and requires participation of an expert. The goal is to create the model representing closely real-life deterioration process known from the records that usually describe average equipment operation under regular maintenance policy with some specific frequencies of inspections and repairs. Compliance with these frequencies in behaviour of the model is a very desirable feature that verifies its trustworthiness.

This section describes a method of model adjustment that aims at reaching such a compliance (Sugier & Anders, 2007). It can be used also for a different task: fully automatic generation of a model for a new maintenance policy with modified frequencies of repairs.

### 3.1. The method

Let *K* represents number of deterioration states and *R* – number of repairs in the model under consideration. Also, let P^{ sr }= probability of selecting maintenance *r* in state *s* (assigned to decision after state I*s*) and P^{ s0} = probability of returning to state D*s* from inspection I*s* (situation when no maintenance is scheduled as a result of the inspection). Then for all states *s* = 1 … *K*:

Let F^{ r }represents frequency of repair *r* acquired through solving the model. The problem of model tuning can be formulated as follows:

Given an initial Markov model *M* _{0}, constructed as above and producing frequencies of repairs^{ sr }so that some goal frequencies F_{G} are achieved.

Typically, the vector F_{G} represents observed historical values of the frequencies of various repairs. In the proposed solution, a sequence of tuned models *M* _{0}, *M* _{1}, *M* _{2},… *M* _{N }is evaluated with each consecutive model approximating desired goal with a better accuracy. The tuning procedure begins with an initial model *M* _{0} and then in each iteration the following steps are performed:

1 For the current model *M* _{i }compute vector of repair frequencies F_{ i }.

2 Evaluate an error of *M* _{i }as a distance between vectors F_{G} and F_{ i }.

3 If the error is within the user-defined limit consider *M* _{i }as the final tuned model and stop the procedure (*N* = *i*); otherwise proceed to the next step.

4 Create model *M* _{i+1} through tuning values of

5 Go to step 1 and proceed with the next iteration.

The error computed in step 2 can be expressed in may ways. As the frequencies of repairs may vary in a broad range within one vector F_{ i }, yet values of all are significant in model interpretation, the relative measures work best in practice:

The latter formula is more restrictive and was used in examples of this work.

### 3.2. Approximation of model probabilities

Of all the steps outlined in the previous section, it is clear that adjusting probabilities

In general, the probabilities represent *K R* free parameters and their uncontrolled modification could lead to serious deformation of the model. To avoid this, a restrictive assumption is made: if the probability of some particular maintenance must be altered, it is modified proportionally in all deterioration states, so that at all times

for all repairs (*r* = 1…*R*).

This assumption also significantly reduces dimensionality of the problem, as now only a vector of *R* *scaling factors* X_{ i+1}=[*M* _{i+1}:

Moreover, although frequency of a repair *r* depends actually on probabilities of all repairs (modifying probability of one repair changes, among others, state durations in the whole model, thus it changes frequencies of all states) it can be assumed that in case of a single-step small adjustment its dependence on repairs other than *r* can be neglected and

With these assumptions generation of a new model in step 4 is reduced to finding roots of *R* non-linear equations in the form of

For the needs of development described in this work the following three approximation algorithms has been implemented and verified on practical examples: (A) Newton method working on linear approximation of*falsi*) method.

#### (A) Newton method On Linear Approximation (NOLA)

In this solution it is assumed that*M* _{i }in step 1 ) and*r* is taken simply as:

Applying these factors to all repair probabilities creates the next model *M* _{i+1}.

This method is very simple and may seem primitive but its noteworthy advantage lies in the fact that no other point than the current frequency

#### (B) The secant method

In this standard technique the function is approximated by the secant defined by the last two approximations in points

After that

To start the procedure two initial points are needed. In this method it is proposed to choose the initial frequency of the model *M* _{0} (

#### (C) The false position (*falsi*) method

In this approach*falsi* method).

As in (B), to begin the iteration the two initial points are needed but now they must lie on both sides of the root, i.e.

Choosing such points may pose some difficulty. To avoid multiple sampling, it is proposed to select

The parameter *α*> 1 limits the overshoot effect. The overshot must be sufficient to ensure (6) but, on the other hand, should not produce too much of an error because that would deteriorate convergence process during initial steps and would add extra iterations. In practice values of *α* = 1.5 2.5 work well. Even if the initial value of*α* then (7) can be re-applied with *α* increased, although it should be noted that each such correction requires solving a new *M* _{1} model and in effect this is the extra computational cost almost equal to that of the whole iteration.

### 3.3. Comparison of the methods

We shall now discuss effectiveness of the above three approximation methods using a sample Markov model tuned for four different repair frequencies. The final result presented to the user - life curves that were obtained from the tuned models - is shown in Fig. 3.

The model of the equipment consisted of 3 deterioration states and 3 repairs (*K* = *R* = 3), with *Ms*1 representing minor, *Ms*2 medium and *Ms*3 major repair. The life curve estimated from model *M* _{0} is shown as case 1 and serves as a point of reference. Cases 2 to 5 were created through adjusting *M* _{0} to modified maintenance as follows:

case 2: frequencies of all repairs were reduced by half, F_{G} = ½ F_{0}

case 3: all repairs but major (*Ms*3) were removed, F_{G} = [0, 0,

case 4: frequencies of all repairs were reduced to 25%, F_{G} = ¼ F_{0}

case 5: all repairs were removed, F_{G} = [ 0, 0, 0 ].

All the three approximation methods (NOLA, secant and *falsi*) converged properly to the same set of probabilities that gave desired goal frequencies. For an example, Fig. 4 compares convergence rate of the methods in the case 2, i.e. during adjustment towards repair frequencies decreased by 50%. The value of the relative error (2) was reduced from initial 100% to 1% after just 3 iterations proving the high effectiveness of the proposed model tweaking. This also shows that simplifications (3) and (4) from section 3.2 are justifiable and do not deteriorate the approximation process.

What also can bee seen for this specific case in Fig. 4 is that the three approximation methods, although significantly different from the mathematical point of view, yield very similar results during the first iterations 1 3. The difference becomes visible starting from iteration no. 4 when, apparently, the secant method generated the approximating point that did not met condition (6) and, effectively, lost this iteration reaching accuracy of the *falsi* method one step later. The same situation happened also in iterations no. 6 and 7. Compared to this, the NOLA method showed no such fluctuations and produced steady improvement in every step, although at a rate not as high as that of the *falsi* method. Fig. 5 presents the convergence rate in the other cases.

Comparing the effectiveness of the methods it should be noted that although simplifications of the NOLA solution may seem critical, in practice it works quite well. As it was noted before, this method has one advantage over its more sophisticated rivals: since it does not depend on previous approximations, selection of the starting point is not so important and the accuracy during the first iterations is often better than in the secant or *falsi* methods. For example, in the case 2 (Fig. 4) NOLA method reached accuracy of 4.4% already after 2 iterations, while for secant and *falsi* methods the errors after two iterations were, respectively, 11% and 9.2%. Superiority of the latter methods, especially of the *falsi* algorithm, becomes undisputable in the later stages of approximation when the potential problems with initial selection of the starting points have been diminished.

## 4. Asset Risk Manager

The Asset Risk Manager (ARM) is a software package which uses the concept of a life curve and discounted cost to study the effect of equipment ageing under different hypothetical maintenance strategies (Anders & Sugier, 2006). The curves generated by the program are based on Markov models that were presented in the two previous sections.

For the program to generate automatically the life curves, default Markov model for the equipment has to be built and stored in the computer database. This is done through the prior running of the AMP program by an expert user. Therefore, both AMP and ARM programs are closely related, and usually, should be run consecutively.

Implementation details of Markov models, tuning its parameters and all other internal particulars should not be visible to the non-expert end user. All final results are visualized either through an easy to comprehend idea of a life curve or through other well-known concepts of financial analysis. Still, prior to running the analysis some expert involvement is needed, largely in preparation, importing and adjusting AMP models.

### 4.1. User input

A typical study is described through a comprehensive set of parameters that are supplied by an non-expert end user. They fall into three broad categories.

continue as before (i.e. do not change the present policy),

do nothing (i.e. stop all the repairs),

refurbish,

Apart from the first type, every action can be delayed for a defined amount of time. Additionally, for “non-empty” actions (i.e. any of the last two types) user must specify what to do in the period after action; the choices are:

(a) to change type of equipment and / or

(b) to change maintenance policy.

For every action user must also specify what to do in case of failure: whether to repair or replace failed equipment, its condition afterwards, cost of this operation etc. Thanks to these options a broad range of maintenance situations can be described and then analysed.

The first action on the list is always “Continue as before” and this is the base of reference for all the others. The ARM can be directed to compute life curves, cost curves, or probabilities of failure – for each action independently – and then to visualize computed data in many graphical forms to assist the decision-maker in effective action assessment.

It should be noted that while the need for some action (e.g., overhaul or change in maintenance policy) is identified at the present moment, the actual implementation will usually take place only after a certain delay during which the original maintenance policy is in effect. Using ARM it is possible to analyze effect of that delay on the cost and reliability parameters.

### 4.2. Life curves

As it has been pointed out before, computing the average first passage time (FPT) from the first deterioration state (D1) to the failure state (F) in the Markov model yields the average lifetime of the equipment, i.e. length of its life curve. On the other hand, solving the model for state probabilities of all consecutive deterioration states makes possible computing state durations, which in turns determine shape of the curve. Simple life curves obtained for different maintenance policies are later combined in constructing composite life curves which describe various maintenance scenarios.

For sake of simplicity and consistency, always exactly three deterioration states, or levels, are presented to the end user: minor, medium and major, with adjustable AC ranges. In case of Markov models which have more than three D*s* states, the expert decides how to assign Markov states to the three levels when importing the model.

Fig. 6 shows exemplary life curves computed by ARM for typical maintenance situations. In each case the action is delayed for 3 time units (months, for example) and the analysis is performed for a time horizon of 10 time units. In case of failure seen in “Do nothing” action, equipment is repaired and its condition is restored to 85%.

### 4.3. Probability of failure

For a specific action, probability of failure within the time horizon (*PoF* _{TH}) is a sum of two probabilities: of failure taking place before (*PoF* _{B}) and after (*PoF* _{A}) the moment of action. It is assumed that failures in these two periods making up the time horizon are independent, so

To compute *PoF*( *T* ) within some time period *T*, the Markov model for the equipment and the life curve are required. The procedure is as follows:

(1) For initial asset condition, find from the life curve the current deterioration state DS_{n}; compute also state progress (SP, %), i.e. estimate how long the equipment has been in the DS_{n} state.

(2) Running FPT analysis on the model, find distributions D_{n} and D_{n+1} of first passage time from DS_{n} and DS_{n+1} to the failure state F.

(3) Taking state progress into account, probability of failure is evaluated as

For better visualization, rather than finding a single *PoF* _{TH} value for action defined by the user in input parameters, ARM computes a curve which shows the *PoF* _{TH} as a function of action delay varying in a range 0 200% of user-specified initial value. An example is demonstrated in Fig. 7 for “Do nothing” action (user-defined delay = 3 time units), where also the two probability components *PoF* _{B} and *PoF* _{A} are shown.

### 4.4. Cost curves

In many financial evaluations, the costs are expressed as present value (PV) quantities. The present value approach is also used in ARM because maintenance decisions on ageing equipment include timing, and the time value of money is an important consideration in any decision analysis. The cost difference is often referred to as the Net Present Value (NPV). In the case of maintenance, the NPV can be obtained for several re-investment options which are compared with “Continue as before” policy.

Cost computations involve calculation of the following cost components:

cost of maintenance activities,

cost of the action selected (e.g. refurbishment or replacement),

cost associated with failures (cost of repairs, system cost, penalties, etc.).

To compute the PV, inflation and discount rates are required for a specified time horizon. The cost of maintenance over the time horizon is the sum of the maintenance costs incurred by the original maintenance policy for the duration of the delay period, and the costs incurred by the new policy for the remainder of the time horizon. The costs associated with equipment failure over the time horizon can be computed similarly except that the failure costs before and after the action is multiplied by the respective probabilities of failures (*PoF* _{B} and *PoF* _{A}), and the two products are added. As in case of probability of failure, ARM presents the end user with a curve which shows the cost as a function of action delay varying in a range 0 200% of user-specified value.

## 5. Conclusions

The purpose of the ARM tool is to help in choosing effective maintenance policy. Based on Markov models representing maintenance actions and deterioration processes, life curves and other reliability parameters can be evaluated. Once a database of equipment models is prepared, the end-user can perform various studies about different maintenance strategies and compare expected outcomes. Since the equipment condition is visualized through the relatively simple concept of a life curve, no detailed expert knowledge about internal reliability parameters or configuration is required.

The system can also automatically adjust the model to requested repair frequencies and thus provides for fully automatic computation of dependability parameters in cases when maintenance policy needs to be modified within some range. This also reduces the model preparation time that requires involvement of the reliability expert and allows for broader range of studies that can be done fully automatically by the end user.