Maintenance data categories (based on (, Table 8)).
Reliability and maintenance data is important for predictive analysis related to equipment downtime in the oil and gas industry. For example, downtime data together with equipment reliability data is vital for improving system designs, for optimizing maintenance and in estimating the potential for hazardous events that could harm both people and the environment. The quality is largely influenced by the repair time taxonomy, such as the measures used to define downtime linked to equipment failures. However, although it is important to achieve high quality from maintenance operations as part of this picture, these often seem to receive less focus compared to reliability aspects. Literature and experiences from, e.g., the OREDA project suggest several challenging issues, which we discuss in this chapter, e.g., for the interpretation of “MTTR.” Another challenge relates to the duration of maintenance activities. For example, while performing corrective maintenance on an item, one could also be working on several other items while being on site. This provides an opening for different ways of recording the mobilization time and repair time, which may then influence the data used for predictive analysis. Some relevant examples are included to illustrate some of the challenges posed, and some remedial actions are proposed.
- data collection
- detection time
Equipment reliability and maintenance (RM) data are widely collected in the oil and gas industry and are needed for predictive analysis, e.g., for oil and gas production systems, to achieve safe and cost-efficient solutions. An important benefit of such activity is optimized maintenance , for example, when finding the optimal inspection intervals for pipelines (see e.g., [2–4]) or when deciding upon the appropriate testing modes for safety instrumented systems (see e.g., ).
In the current chapter, we focus on the maintenance part of the RM data collection and mainly on the measures linked to maintenance activities associated with failed items, i.e., equipment with downtime. Thus, attention in this chapter is primarily on the time interval from when a failure of a reparable item occurs to the time when it is back in an upstate.
Updated information about equipment downtime is important and relevant for various types of analysis within the oil and gas industry to inform decision-making. In particular, the measure “mean time to repair,” commonly abbreviated as “MTTR” is widely used within this industry for the purpose of reliability, availability and maintainability analysis (see e.g., [6–9]). It is also widely used in, for example, design planning (e.g., [10, 11]) and also in relation to safety integrity level verification for safety instrumented systems (e.g., [5, 12, 13]).
However, there are several obstacles related to the collection of such data. One of the challenges is simply to get sufficiently high quality in the data for downtime predictions, for example, for assessment of the expected repairing time, i.e., the “MTTR.” The problem is typically twofold; the relevant population is too small (e.g., technology is changing, and old data, which has taken years to collect, may not be relevant anymore), and the taxonomy used for the data collection may not be sufficiently clear, which may give room for interpreting information differently.
This is the reason why the international standard ISO 14224  is such an important document. The document is partly a result of industry field feedback implemented for more than 25 years within the OREDA project (see [1, 15]), which has led to the ISO 14224 standard . It represents a main guidance document for collection of RM data within the oil and gas industry, applicable in par with the dependability standard IEC 60300-3-2 , and outlines principles and taxonomy for how to achieve quality data. It is a way of ensuring data quality, e.g., through a consistent taxonomy being used (see e.g., [18, 19]).
A much-ignored issue when collecting downtime data is the difficulty to measure the time to detect the failure, i.e., the exact time of failure. This time is normally referred to as the “detection time” (TD) (see e.g., ). It is a key value needed, for example, when assessing the expected time to achieve the repair of the failed item, i.e., the so-called mean time to repair, widely referred to by the abbreviation “MTTR.” The main problem is that, in situations where failures are hidden and these are not evident before some demand occurs, it is most difficult for a data collector to assess or specify the exact time of the failures. Besides, it may not be possible to confirm whether in fact these values are true or not. Often, rather one attempts to ignore the issue by claiming that the precision of this value is negligible, as TD ≪ MTTR, or MTTR ≪ MTTF, where MTTF is the common abbreviation for the “mean time to failure.” In other words, the numeric value of TD is assumed to be of low importance to the predictive analysis.
In addition, when using the data on repair times from a database, one normally mixes together situations where the failure is detected at once (being in continuous mode) and those situations where the failure is hidden (or dormant) for a while until a demand occurs (e.g., revealed from a functional test of the equipment). Consequently, to guide for more consistent data, the newly issued  has tried to limit the use of MTTR, despite the strong position and frequent use of this measure in the oil and gas industry.
Similarly, the time to mobilize or carry out the repair may also be subject to uncertainty, depending on how the data collector interprets these measures. Having a consistent interpretation of the key terms used to predict downtime is crucial in order to obtain high-quality maintenance data.
An objective of this chapter is to critically examine key terms and associated measures used in maintenance data collection, by studying the repair time taxonomy defined in Ref. , i.e., the different terms used in data collection to define downtime in relation to failed items (see Section 2).
The remaining part of the chapter is structured as follows. Section 2 gives a brief description and definitions of key measures, mostly based on the terminology in [14, 20], such as the MTTR. Section 3 provides two example cases illustrating the effects of using different interpretations of the MTTR. Section 4 links data collection to decision-making and presents findings from data collection experiences identified from published literature, addressing several challenging issues that could compromise the maintenance data quality and use. Then Section 5 provides a discussion on to what extent the industry is in line with ISO 14224 taxonomy and how to handle the issues identified. In Section 6 we give some conclusions.
2. Key downtime terms and measures in maintenance data collection
As indicated in the previous section, ISO 14224  gives guidance both on how to collect and how to analyze downtime data and is a document strongly linked to ISO/TR 12489  on reliability modeling and also the two dependability documents [17, 20].
The recently issued  specifies four different maintenance data categories; see Table 1, where details are given on what is the minimum data that must be collected. For example, any corrective maintenance action shall be paired with the associated failure event (i.e., failure record). Downtime data is also labeled as minimum, meaning that the data collector must specify the total length of the downtime interval.
|Data category||Examples||Minimum data|
|Identification||Equipment tag number, failure reference||Unique maintenance identification, equipment identification/location, failure record|
|Maintenance data||Maintenance category = preventive or corrective||Date of maintenance, maintenance category|
|Maintenance resources||Type of main resource(s) and number of days used, e.g., drilling rig, diving vessel, service vessel||(No data specified as minimum)|
|Maintenance times||Time duration for active maintenance work being done on the equipment||Active maintenance time, downtime|
It is common to split the downtime interval in three main parts, i.e., the active repair time, and the activities before and after (such as waiting, delay, start-up). Obtaining accurate information regarding these down time segments is challenging. One such example is already mentioned, i.e., the detection time.
Another example is the “active repair time,” which could easily be confused with the “active maintenance time.“  distinguishes between similar terms such as the “active maintenance time,” “active repair time” and “overall repairing time” and uses these when addressing measures, i.e., expected values, such as the “mean active repair time,” “mean overall repairing time,” “mean time to restoration” and “mean time to repair.” We will further address and discuss the meaning of these terms and measures in the following sections.
2.1. Mobilization time
The mobilization time is normally a main part of the repair preparations. It includes mobilization of all types of resources required, such as vessels, personnel and ROVs. It includes all activities carried out to get the necessary resources available to execute the active repair of the failed items.
In Ref. , there is also a relevant note to the entry associated to the definition, which states that “time spent before starting the maintenance is dependent on access to resources, e.g., spare parts, tools, personnel, subsea intervention and support vessels.” The mobilization time is therefore sometimes difficult to distinguish from delays caused by manufacturing time and transportation.
In practical terms, the mobilization of intervention vessels is often described using the term “opportunity maintenance,” meaning that the intervention vessel is on site or called for in relation to other activities. For example, the vessel could already be on site when the failure is detected, or some critical failure is somehow making the mobilization more urgent and prioritized. To deal with such situations, it is important to have clear procedures for how to collect the mobilization data. Typically, the mobilization time TM is specified as TM = 0, if other items are mainly responsible for the vessel order. However, for analysis purposes, it is important to be aware of which maintenance activities are included in the order and which are not.
Information about an intervention vessel can also be found in long-term schedules, where the time when the vessel is on site largely depends on the planned route. For example, the vessel could plan for 30 days at a “Site A,” then 30 days at a “Site B” (which could then be operated by a different company), then at “Site C,” etc. Mobilization will then depend on both whether the item is critical for production and safety and whether the intervention vessel by chance is at the site or soon is coming to this site.
Furthermore, it is not straightforward how to cope with the issue of multiple maintenance and mobilization activities. Often, several maintenance calls are issued, and the allocation of a maintenance vessel needs to be cost-efficient. Thus, one may find that there is a need to remobilize the vessel, as it was not able to complete the maintenance actions on site before moving to the next location.
2.2. Active maintenance time
The active maintenance time is, as defined in Ref. , “Duration of the maintenance action, excluding logistic delays.” Per the definition, other delays such as technical delays are thus included in the active maintenance time. The measure is often referred to as “the effective time to repair” (see e.g., ). This is regardless of whether the repair is performed in one go.
However, there is nothing stating that the maintenance activity must be completed during the downtime. Hence, parts of the maintenance, or in some situations (and especially for preventive maintenance) most of the maintenance, relate to an upstate of the equipment. As mentioned, several separate activities may be performed before the maintenance is completed. For example, one may try different options and run checks on whether the performance is satisfying, or one might have to wait for equipment parts to arrive, etc. However, the time used to run-down or start-up the equipment is considered as part of the uptime and not the downtime.
In practice, the active maintenance time could include various testing of the equipment, to check its condition. Depending on the urgency of getting the item back into operation, the duration of this activity might be significant. This also relates to, e.g., wells that are alternating in production, which usually makes the maintenance activity less urgent and allows for more “experimentation.” The most cost-efficient solution is then perhaps not the most time-efficient one. For example, one could opt to use time-demanding repairing tools with lower cost and lower efficiency.
The active maintenance time is, by including technical delay, interpreted in Ref.  as, “the calendar time during which maintenance work on the item is actually performed.” Although normally not the case, it is possible for the active maintenance time interval to be larger than the total downtime. This would be true in situations where the equipment is running in operation while the maintenance activity is ongoing.
2.3. Active repair time and mean active repair time
The effective time needed to achieve the repair of an item is called the “active repair time” (see ) and is a key part of the downtime as shown in Table 2, in Phase No. 3. It consists of, per the IEC 60050-192 , three possible activities, i.e., the fault localization time, the fault correction time and the function checkout time. See also (, Figure 5), which compares repair time taxonomies provided in [20–22] (currently also available in the International Association of Drilling Contractors (IADC) Lexicon definition of “mean active repair time” .
|Phase No.||Description||State of equipment|
|1||Time of failure and then run-down||Uptime|
|2||Preparations and delays||Downtime|
|3||Active repair time||Downtime|
|4||Waiting and delays||Downtime|
The active repair time is different from the active maintenance time, as the preparations and delays are not included (Phase No. 2 and No. 4). But, it is not necessarily the same as the number of man-hours needed to achieve the repair. The number of man-hours also relates to the amount of personnel working with the repair and is thus not directly linked to the active repair time. Similarly, the time of resource use may only capture a part of, and give a misleading picture of, the overall active repair time. For example, an ROV is often used for subsea maintenance operations, but the use of the actual ROV time may be split on several simultaneous maintenance operations. There may be several other ROV activities carried out, while the ROV is subsea.
The expectation of the active repair time is a relevant key performance indicator (KPI) for downtime, i.e., the mean active repair time (MART). It is listed in (, Annex E), to have purpose and value through “indication of the productivity and work content of repair activities.” It is noted that if one is also interested in the preparation and delay times, then the mean downtime (MDT), comprising Phase Nos. 2–4, is a relevant KPI or measure.
2.4. Overall repairing time and mean overall repairing time
ISO/TR 12489:2013 defines the mean overall repairing time (MRT) as the “expected time to achieve the following actions:
The time spent before starting the repair; and,
The effective time to repair; and,
The time before the item is made available to be put back into operation.”
 gives the same understanding of the elements included but instead refers to MRT as the “mean repair time,” which although somewhat similar introduces a variation of the term that is found in practice but which is distinct from the “mean time to repair” (see Section 2.6).
By including the time spent before starting the repair and the time to prepare the item for operation, the measure is synonymous with the MDT when fault detection time is equal to zero. The MDT is simply the expectation of the downtime, i.e., the mean time the equipment is not in a standby or operating mode (upstate).
2.5. Detection time
Detection time is the period from when a fault occurs to the time when this is detected, where fault in this context refers to the equipment being unable to perform as required due to some internal state, in line with the definition in Ref. . The term “fault detection time” is sometimes used for more specificity, as, e.g., in Ref. . However, this term is not the same as the “fault localization time,” i.e., the time taken to complete fault localization, although the two terms appear similar. Fault localization takes place after the fault is detected and during the period of corrective maintenance action. Fault localization often includes the activity of diagnosing at what time the fault occurred (see e.g., ).
Fault detection may be achieved through manual or automatic operations, depending on modes of operation and system characteristics. Faults of safety systems with possible long detection time, e.g., revealed through functional testing, often make it challenging to identify the exact detection time. Assumptions and estimates are then normally made based on the testing intervals.
The expectation of the fault detection time is called the “mean fault detection time” and is abbreviated as the “MFDT” (see e.g., ). For immediately revealed failures, the value of MFDT is equal to zero and negligible for failures with short detection time. Otherwise, this value strongly depends on the test policy for the equipment. For hidden failures, the detection time may represent the main part of the downtime.
Sometimes, for assessment of reliability and maintenance performance, the abbreviation “MFDT” is also used for the ”mean fractional dead time.” This term has a completely different meaning, i.e., a measure for the average unavailability expressing the expected fraction of time in a nonfunctional condition (refer to use in, e.g., (, p. 428, )). Obviously, the two terms should not be mixed.
2.6. Restoration time, mean time to restoration and mean time to repair
Restoration time (or time to restoration; see ) includes also the fault detection time in addition to the elements comprised by the overall repairing time (see above). The expectation of this value, i.e., the mean time to restoration (MTTRes), thus includes the full picture of:
Fault detection time.
Preparation and delays (administrative, logistic and technical delays).
Active repair time.
Delays after the item is repaired (mainly administrative).
The variation in the meaning of MTTR as the ‘mean time to restoration’ versus the ‘mean time to repair’ makes it unclear whether all the four elements are captured, i.e. the full picture above. It is often challenging to separate what definition is used in practice by only looking at the abbreviation “MTTR.” The change of the meaning of MTTR in 1999 (see ) has the engineering population still divided between those using the present time definition, ”mean time to restoration” and those keeping with the old definition, “mean time to repair” and a reluctance to change .
 also defines the mean time to repair (MTTR) as “expected time to achieve repair of a failed item.” The problem with this definition is the fault detection time, which is either zero when the failure is revealed immediately, or it is unknown. If it is possible to include the detection time, the MTTR is equal to the MTTRes; otherwise MRT is considered a more appropriate measure. When using the MTTR, the meaning could be all of the three measures above, as illustrated in Figure 1, depending on the length of fault detection, active repair time and preparations and delays, which is causing unnecessary confusion to data collection and analysis.  therefore avoids the use of the term MTTR.
2.7. Availability measures: intrinsic availability
The length of downtime and associated measures are important in computations of availability, where availability is often estimated (e.g., by Monte Carlo simulations) by the use of terms such as the MTTR and MTTF. This is the case when calculating the intrinsic (or inherent) availability for some component (AI), where one considers the corrective maintenance downtime of the system:
where MTTF is the mean time to failure. However, as already mentioned  notes that it is more appropriate to refer to the active maintenance time observed in the field, i.e., the MTTRes, which represents a more meaningful term compared with, e.g., the MTTR. The formula should therefore be instead.
The MTTRes here is not the same as the mean downtime, MDT, although the two measures may have the same value. Replacing MDT with MTTRes, and MTTF with mean uptime (MUT), would give the operational availability instead of the AI and is perhaps a more relevant measure from a maintenance perspective. Both are used to express proportion of time that the equipment or system is in an upstate, but the MDT is generally not considered an intrinsic property. The duration of MDT could in practice include a variety of delays (e.g., detection, isolation, spare parts, standby, repair duration, reinstatement, etc.; see (, Annex C)). The mean up- and downtime are measures that depend strongly on the system performance, i.e., reliability, maintainability and maintenance support, and therefore dependent on the context in which they are used. The MTTRes, by focusing on the maintenance resources and disregarding external resources, thus allows for a more intrinsic analysis.
3. Significance of taxonomical differences
To illustrate the effects of different interpretations of maintenance times, two example cases are provided that illustrate the challenges related to situations where failures are detected with short-time versus long-time intervals. In the first case, we use a data set for subsea control modules (SCMs) obtained from the OREDA database. In the second case, we have constructed a data set for downhole safety valves (DHSVs) based on reliability data collected annually by the Petroleum Safety Authority Norway (PSA) to analyze and map the risk level on the Norwegian Continental Shelf, the so-called RNNP project (see e.g., ). Both data sets have been randomized and anonymized for confidentially reasons and are thus solely for illustration purposes.
3.1. Example I: short detection time
The data set consists of 375 SCMs, with a total time in service of 1.44 × 107 hour. During this period, a total of 255 failures have occurred (counting any type of failure severity, i.e., critical, degraded or incipient). An estimator for the mean time to failure, MTTF, is thus given as 1.44 × 107/255 = 56,471 hour. We now consider the implications of varying definitions of the repair time, in this context referred to simply as MTTR. Let us assume that there is uncertainty on all of the components that together form the MTTRes, i.e., fault detection time, administrative delay prior to repair, logistic delay, technical delay, active repair time and administrative delay post repair. For simplicity, we will assume that each of these parameters is represented by a triangular distribution denoted T1:
T1 (1; 8; 20) hours, i.e., minimum = 1 hour, peak value = 8 hour and maximum value = 20 hour.
All of the components of the MTTRes are assumed to be independent of each other.
This distribution is not based on any real data set, but delays and repair times on an SCM could realistically be at least 20 hour.
Recalling from the previous chapter we have:
MART = active repair time (fault localization + correction + function checkout).
MRT (overall repair time) = MART + administrative delay prior to repair + logistical delay + technical delay + administrative delay post repair.
MTTRes (time to restoration) = MRT + fault detection time.
Thus, we can establish three definitions for the intrinsic (inherent, technical) system availability, A:
For now, we assume that the MTTF is deterministic, and thus all uncertainty is placed on the MTTR component. Running a Monte Carlo simulation using N = 10,000 gives the distributions for A1, A2 and A3 shown in Figure 2.
As Figure 2 shows, widening the definition of repair time to include delays and fault detection will lower the system availability and increase the standard deviation, as there is obviously more uncertainty. The deviations between the three definitions of repair time are not significant in this particular case, as the MTTF is relatively much larger. Consider however Figure 3, where a sensitivity analysis is run for the MTTF, showing the expected absolute and relative difference between A1 and A3 (blue line) and A2 and A3 (red line), where A3 is a main element of both A2 and A3, i.e., the differences thus indicating the contribution of detection time. As the MTTF becomes lower, the significance in the varying interpretation of repair time becomes as expected greater.
Figure 3 also shows the expected relative difference between the two measures, showing a decreasing value in the approximate MTTF interval = [500; 10,000]. Indicating that especially when the MTTF is higher than 10,000 hours, i.e., around 1 year, it is fully acceptable to ignore the contribution of detection time in such calculations.
While there is the obvious point that MTTR generally plays a more important part of availability the lower the MTTF is, there are also other reasons why the MTTR generally is skewed towards the right of Figure 2. When collecting subsea data, for example, for OREDA, this process is often time-demanding and costly; see e.g.,  when done manually, which is often the case. For this reason, the priority is often to collect failures and any maintenance related directly to these. This comes at the cost of sometimes disregarding opportunity maintenance or other types of no-failure maintenance, where the equipment in question is actually in a downtime state, thus overestimating the total time in service and consequently also the MTTF. Essentially, the actual MTTF is bound to be lower than what is often used as an estimate, and thus the difference in repair time definition becomes greater. In addition to component-related maintenance, there are also at times planned shutdowns of wells or even fields, which are not always captured for the same reasons and which also emphasize the point made. Furthermore, as mentioned in Section 2.1, there is the difficulty of distinguishing mobilization time from delays caused by manufacturing time and transportation, meaning there is a possibility of potentially both longer or shorter times added to, or subtracted from, extended definitions of MTTR. According to Ref. , quality checks on various equipment items (not necessarily subsea items) yielded wrong interpretations or coding used during data collection in 39% of the cases. Such errors could also swing both the MTTF and the MTTR in either direction but will certainly give rise to variations in the expected system availability.
3.2. Example II: hidden failures
A main objective of this second example case is to address the important issue of hidden failures (also called dormant failures). These are failures that, according to Ref. , are not immediately evident to operations and maintenance personnel. This means that it may take some time before detection, as it is not possible to detect these failures unless some specific action, such as a periodic test, is performed.
For the second data set, we refer to a population of 8714 DHSV tests collected from 73 facilities operating at the Norwegian Continental Shelf during the time period in 2012–2016 from the RNNP project , which in contrast to the SCM population is taken from a wide range of oil and gas operators. The availability requirements refer to an industry standard (see Table 3) based on the requirements set by the Norwegian oil and gas operator Statoil. These requirements are also referred to as the “failure fraction,” FF, the ratio between number of safety critical failures revealed from periodic testing, x, and the corresponding number of tests performed, n:
|Barrier element: DHSV|
|Number of facilities where tests were performed in 2016||73|
|Average number of tests for facilities where tests were performed in 2016||119|
|Number of facilities with percentage failures in 2016 greater than the industry standard||25|
|Total (mean) percentage of failures in 2016||0.023 (0.026)|
|Total (mean) percentage failures 2012–2016||0.021 (0.021)|
|Industry standard for availability (Statoil value)||0.02|
For simplicity, we assume that the valves are tested at the maximum interval, i.e., twice a year (as defined in Ref. ), meaning that if a failure is detected from testing, the valve failure has occurred at some point in time within the interval = [0, 6] months. The data set then corresponds to an estimated 4357 DHSVs. These valves are associated with a total number of 200 failures recorded from this population.
Furthermore, we assume that, except for the detection time TD, each of the parameters is representable by the triangle distribution defined in Example I, T1 (1; 8; 20) hours. In addition, another triangular probability distribution is made for the time to detect the failure, denoted T2 (1; 2160; 4320) hours, i.e., a peak (mean) value of 3 months. This corresponds to a total downtime of 4.32 × 105 hours.
Although the various delay times may be correlated, we assume the parameters are independent of each other. For the time to failure, we calculate this based on the RNNP data in Table 3, showing the percentage of critical failures, i.e., exceeding the acceptance criteria of the barrier testing. The calculations based on PSA data  yield a total time in service of 7.49 × 107 hours. An estimator for the mean time to failure, MTTF, is thus given as 7.49 × 107/200 = 3.74 × 105 hours. We now consider the implications of varying definitions of the repair time, in this context referred to simply as MTTR. We assume that there is uncertainty on all of the components which together form the MTTRes.
Figure 4 shows, similarly to Figure 2 for Example I, that widening the definition of repair time to include delays and fault detection will lower the system availability and increase the standard deviation. For this example, in contrast to the previous, the detection time is significant. Also for this example, the MTTF is relatively high; however, now the differences between using MTTRes and MRT are much greater, where the deviations in the mean availability between the A1 (i.e., MTTR = MTTRes) and A2 (i.e., MTTR = MRT) is equal to 0.547% versus a deviation of 0.061% given for Example I. And although there is significant uncertainty related to the value of MTTF, the relevance of detection time is significant.
Consider also the value of the MTTF in Figure 5, where a sensitivity analysis is run, showing both the expected absolute and relative difference between A1 and A3 (blue line) and A2 and A3 (red line). As the MTTF becomes lower, the significance in the varying interpretation of repair time becomes as expected greater also in this example.
Example II indicates that the MTTF must be significantly higher before the availability deviations converge towards zero, at least a factor of 10 higher than the SCM situation (Example I). Only when the MTTF is in the region of 500,000 hours (60 years or more) can the contribution from detection time be considered negligible in this example.
Besides, one may claim that one could see significant variations in this value. For example, in the 2010 edition of the PDS handbook , which presents recommended data for safety instrumented systems, DHSVs are assigned with overall failure rates in the range between 2.0 and 6.7 per 106 hours depending on the data source, corresponding to a MTTF value inside the interval = [58, 17] years. Knowing also that there are significant differences between companies and operating conditions, the estimated value of detection time and the measure selected (i.e., MRT or MTTRes) could significantly influence the availability calculations.
4. Maintenance data collection link to decision-making and challenges
Information about downtime is highly important for decision-making in the oil and gas industry, including for subsea systems. For example, such data is needed to track maintenance KPIs and achieve the so-called maintenance excellence; refer to references [33, 34].
In general, having useful information about RM is important for high-quality decision-making, and it is one of the six key dimensions that are used to evaluate decision quality :
Helpful frame (what is it that I am deciding?)
Creative alternatives (what are my choices?)
Useful information (what do I know?)
Clear values (what consequences do I care about?)
Sound reasoning (am I thinking straight about this?)
Commitment to follow through (will I really take action?)
The above six dimensions can be visualized as a chain of decision links (see also (, p. 55)), where the decision is not stronger than its weakest link, which simply means that poor information (i.e., the RM data in this context) deteriorates the decision quality. It is also pointed out in Ref.  that the information or data should be “useful” and hence should be compared with its area of use, in this case within the area of RM data applicability and how such data may create business value by contributing to good decisions.
Despite the link to business value, and the broad consensus that use of RM data strongly depends on its quality, collecting such data about RM performance in the industry has typically received considerably less attention compared with the use of the information.
To some extent it is inevitable that data is not always suited for the decision-making, as the system requirements, design and operations change over time. Neither are the databases sufficiently flexible to adopt for the data needs all the time. The data sources are, in many situations, unmanageable at the time when they are needed. For example, when analyzing and predicting downtime for some subsea safety valve, one typically uses the data already collected and at the time available from some database or source, such as OREDA (see e.g., [1, 37]), WellMaster (see e.g., ) or some internal database. There may be at the time limited room for collecting new and more relevant information. Data collection requires time, personnel and software resources. It could take years to build a quality database, and strategic decisions should be taken about what equipment data and format are relevant in the future. At the time of analysis and decision-making, there may not be sufficient resources (e.g., time) to collect additional information.
RM data collection is often an issue of resources, such as cost. In Ref. , it is mentioned that “collecting RM data is costly and therefore it is necessary that this effort is balanced against the intended use and benefits.” The RM data collection activity is considered an investment, but it is an issue that strongly depends on both how and why the data are collected. Besides, it is not always clear what time perspective one is considering. Several years of data collection could be required for the data to have significant population and value in decision-making.
A similar point is made in Ref. , with reference to the Well Integrity Management System (WIMS). This is a software application for data collection, and  claim that it is important that both accurate and reliable information are achieved, at least from a user perspective. Nowadays data is shared with, and should be compatible with, other systems used by the oil and gas companies, such as, the data management system SAP, to synchronize data quality. The taxonomy used in the RM database should to a large extent also be reflected in the sources where the data is first recorded, such that the transfer of failure data does not have to go through several interpretation steps where essential information is missing.
Another issue is cost and ownership, which is a main reason why the data collection process is often delayed. Experience shows that often the data collection is performed at a quite late stage compared to when the actual failure occurred. The main information is typically captured by, e.g., reports from the maintenance provider. Then it is later the task of some data collector to transfer the relevant information into a RM database, including identification of missing and low-quality information. For example, the time to mobilize is normally given with low accuracy in such reports. The same is the situation when studying data from typical replacements, for example, choke replacements. The number of hours to replace such an item is too often given as an estimate, for example, 24 hours, as the detailed information may not be available from the maintenance reports.
The information provided by the MTTR, which is a common and much used KPI in the oil and gas industry (see e.g., [40, 41]), is also challenging. The KPI has a strong position within this industry, especially for availability calculations, and it is highly difficult to avoid the use of this term by replacing it with MRT and MTTRes, as suggested in Ref. . A quick search in “Google scholar” (January 2016), for the period 2013–2016, confirms that the term “mean time to repair” (MTTR) is used significantly and way more than the term “mean time to restoration” (MTTRes). The search results confirm that on several accounts, the term used is MTTR, while the actual basis for estimating this value is equivalent to “mean time to restoration,” which is the specific term and abbreviation used in the two IEC documents [20, 22] but then with the meaning of MTTRes. The abbreviation is also used for the mean time to recover (or recovery), synonymous with the “mean time to repair.”
RM data is used for a vast array of different applications, both operational and engineering. Data use today is as relevant as it was 30 years ago, for example, according to Ref. , reliability data is used to design the operational phase in terms of evaluating the operational performance of equipment, adjusting maintenance intervals, optimizing test intervals, establishing failure probability distributions, optimizing spare parts and logistics and job priority scheduling. Of similar importance to the engineering of equipment components, reliability data provide input to analysis such as safety integrity level analyses, RAM studies, required maintainability, selection of equipment and parts based on reliability experience, choice of maintenance strategy and qualification testing.
The OREDA database, which has a taxonomy based directly on ISO 14224, contains more than 39,000 failures and 73,000 maintenance records, as well as over 2000 years of operating experience from subsea fields. Its current member list includes BP, Engie E&P Norge, Eni, Gassco, Petrobras, Shell, Statoil and Total. Expert members from the OREDA project are also involved in the development of ISO 14224, to make sure that the challenges experienced in practice are captured when revising the standard. When the latest edition was issued , it was possible to include several definitions of key measures relevant for downtime predictions, such as, the new definition of the mean time to restoration (MTTRes).
Considering the cost of subsea maintenance operations, downtime and mobilization of vessels, the range of affected decisions, operations and procedures and not least the extent to which safety systems are used in today’s industry, the importance of RM data quality cannot be emphasized enough. We will next discuss some challenges relating to the practical use of the ISO 14224 taxonomy, what impacts this may have on decisions, operations and design and some suggestions to improvements.
5.1. Data separation
One of the key challenges when collecting RM data is related to the fact that the RM database in many cases exists as a separate entity which may only partially communicate or not communicate at all, with other relevant systems (e.g., SAP). It is not rare to observe that data stored in an RM system such as OREDA or WellMaster is essentially data extracted from another system or database, converted into an appropriate format and then re-entered. This creates several challenges relating to both data integrity and quality:
Creates an unnecessary overhead; data could simply be stored in one place, possibly supplemented by automatic or manual conversion if a special format or interpretation was required, as is often the case for standards that use a specific taxonomy.
Requiring access to multiple data systems, means spending more time interviewing and more time spent on data interpretation, since it is likely that the different systems have different data formats, requiring multiple “reformatting” efforts to reach taxonomical compliance. The data are not even necessarily found within the same company, but may be located with suppliers or sub-suppliers.
Since crucial data, such as maintenance or vessel data, may not be stored directly into a RM system, then there is also the risk that original records do not contain all required fields, meaning that if the persons responsible for the operations cannot be contacted or do not recall specifics, then required data will be impaired, severely inaccurate or lost forever.
5.2. Mobilization time
The mobilization time, as previously mentioned, includes all resources, including personnel. The source for obtaining such data varies. Sometimes, such as in the event of unexpected, significant failures, specific detailed maintenance reports may outline the mobilization time of resources used. In other cases, such data are available through data stored in vessel log reports, and the data stored into the RM database is typically an interpretation of this. These reports however seldom store records of other resources than the vessels themselves, such as personnel. Maintenance records for which no records exist in neither vessel logs nor other systems have unknown mobilization times and must therefore be estimated. This is typically done based on expected duration to bring the vessel to its destination. Such estimates will thus be optimistic figures in cases where there were delays, where there were shortages in spare parts or where there were other delays that should have been considered, but were not. Conversely, in some cases, due to lack of accurate maintenance records or other source documents, the time allocated to mobilization becomes too large if the same maintenance campaign covers several maintenance jobs but where mobilization is not split across all maintenance records. There is also a fair chance that data collection will not necessarily capture any remobilization related to maintenance jobs, if these occur at a point after the maintenance data was collected or if this remobilization is not logically mapped to the original mobilization activity.
It may be challenging to use statistical data to estimate the mobilization time due to the issues discussed above. In many situations, one must rely on experts having insights into the planning of maintenance activities, to assess and recommend representative values of mobilization time. For example, the mobilization is often highly area and company sensitive, where the mobilization time could be significantly shorter within one specific geographic area, e.g., due to a higher number of cooperating vessels operating within the area.
5.3. Detection time
The abbreviation for “mean fault detection time” (i.e., MFDT) is, as already mentioned in Section 2.5, an abbreviation that may have two distinct meanings, both of which are relevant for assessment of reliability and maintenance performance. However, in practice, this discrepancy is not widely problematic, as the specific meaning of both terms is considered well known.
In many situations, it may be reasonable to ignore the contribution of detection time when making assessments involving downtime measures. However, as we may see from the two examples given in the previous section, the significance partly depends on the equipment dealt with and obviously the type of assessment. It is especially important to distinguish between the two types of “MTTR,” i.e., “MRT” and “MTTRes,” when the failure rate is high and the time of failure is uncertain.
The detection time may naturally be subject to high uncertainty when dealing with hidden failures. However, that is not always the situation. Detection time is closely linked to equipment degradation modeling, such as, the use of so-called P-F interval models (see e.g., (, p. 394)). A P-F interval is the time from when there is a potential for failure (TP) due to the equipment being in a condition where it is possible to reveal some fault (e.g., from periodic functional testing, condition monitoring or inspection) to the time when the failure occurs (TF); see Figure 6. If the item is subject to periodic testing, and the test interval is shorter than the P-F interval, then one may have a situation where a fault is always detected from testing before any failure occurs. By using such models, it is possible to assess the time from when the fault is revealed to the time of the fault by comparing the condition of the item and the P-F interval model.
Detection time should also be seen in relation to the probability of detection, which relates to both the quality of testing and inspection activities, and to the incentives of accepting versus failing tests. For functional testing of equipment exposed to hidden failures, there may be strong incentives of passing tests that are close to the acceptance criterion. For example, if failing a valve leakage test leads to a more frequent test schedule, documentation work, etc., then the test personnel might pass a test even though some initial result shows that the leakage is 2.1 bar and thus just over the acceptance criterion of 2.0 bar. One could find it convenient to extend the test or make some adjustments on site to achieve a time interval where the leakage result is found acceptable. The incentives of failing or accepting tests is likely to both influence testing schedules and the statistics concerning how many of the results in the area around the acceptance criterion are reported as “failures” and thus also the estimated MTTF value.
5.4. Mean time to repair (MTTR)
As with MFDT, MTTR is also an abbreviation that may have several different meanings. Within the oil and gas industry, the letters used in the abbreviation could refer to several terms which all make sense within the area of analysis, and may thus cause analytic ambiguity. The “TT” term consistently refers to “time to.” However, the two other letters are found highly inconsistent in use. “M” could refer to “minimum,” “mean” or “maximum” and likewise the “R” term could also refer to different meaningful words, such as “repair,” “recovery,” “respond” or “restore/restoration.” It is therefore sometimes confusing when the abbreviation is not explicitly defined.
The use of the term “mean time to restoration,” with the abbreviation MTTRes, as suggested in ISO 14224  clearly reduces the chance of making wrong assumptions regarding use of MTTR.
The use of MTTR very much is an issue of quality. By comprising both MTTRes and MRT, it represents a measure that fails to give consistent information. In some situations, MTTR will provide the MTTRes information, and in some situations, it will provide a combination of the two. To use the MTTRes makes it clear that the intended meaning is captured. By keeping the MTTR as a useful term, one must ensure that the “R” refers to “restore” and not “repair.” For decision-making purposes, the MTTR measure is better avoided, as also recommended in the international standard .
5.5. Failure fraction (FF)
The failure fraction, while being a measure that is simple to understand and use, also has some limitations with respect to time aspects. Using the FF as an availability measure, for example, makes it difficult to draw conclusions when the population varies significantly in detection time and number of demands. The value expressed from using this measure does not separate between equipment tested monthly or tested once a year, which makes it difficult to use the values for estimating the MTTF.
Failure fraction is typically used for equipment linked to hidden failures. If assuming that the fault does not occur during the testing, the measure provides relevant information about the number of tests that reveal hidden failures. However, such an assumption strongly depends on the testing interval, as there is a possibility that the fraction of test inducted faults may be high, and one could then argue that the FF, and consequently also the MTTF, would be far higher if the testing interval was reduced.
A main challenge is that the failure fraction ignores the number of demands or faults that occurred since the last test, which is important information regarding the availability of the system. Hence, it is possible that instead of one failure, there should have been recorded, e.g., two failures, where a shorter test interval could have detected the two distinct faults initiated at different points in time instead of only the one that was registered at the point of testing. Furthermore, some of the failures could be observed from a demand and maintained prior to the functional testing and thus not be included in the FF statistics.
5.6. Decision-making quality
Typically, the quality of data influences the quality of decision-making. The process also comprises other elements, as listed in Section 4. Part of this quality relates to consistency in the use of the downtime measures for, e.g., availability calculations, which we exemplify in the current chapter.
Based on the above discussion, the use of downtime measures is found ambiguous in the sense that they can be given different interpretations, such as the situation now appears to be with the term “mean time to repair,” and, in particular, the abbreviation “MTTR,” and may thus contribute to reduced data and decision-making quality. The attempt of  to reduce ambiguity by defining “MTTR” as “mean time to restore” and thereby include the detection time is welcome in that sense (see also ). However, as the “MTTR” is still being widely used with previous definitions, the “mean time to repair” and the “mean time to restore” could be difficult to separate. The ISO 14224  term “mean time to restoration” with the abbreviation “MTTRes” is considered less ambiguous.
In this chapter, we have looked at different terms used to describe downtime. Different terms are used in data collection to define downtime, where some may be questioned to not provide adequate quality needed for associated analyzes and decision-making where the RM data is used.
The ‘mean time to repair’ is a term that is well-established and widely used for e.g. availability calculations. Although the meaning of this term has shifted over time, we find examples where MTTR refers to different meanings and thus is a challenging term to use. The solution proposed in Ref. , to avoid the use of this term, and instead use MTTRes and MRT, is considered an acceptable way to deal with this challenge. Data collection experience indicates that to complete a change of the MTTR meaning is difficult, as the term has such a strong position within the oil and gas industry.
Another challenge discussed in this chapter relates to the mobilization of maintenance resources. Experience shows that it is difficult to both interpret and quantify mobilization times in practice. Part of this problem is that resources may be linked to several maintenance activities on site, which provides an opening for different ways of recording the actual time used to mobilize and to repair the specific item. It becomes an interpretation issue, which may influence the values used for prediction of time needed to achieve repair of a failed item. Limited guidance, except for adequate definition, is given in Ref.  on this issue.
In general, we recommend that data collection is given higher attention compared with the situation today. Typically, investments are focused on building models and using the data rather than obtaining them and ensuring high quality. Especially, more focus should be on achieving high-quality data from maintenance operations.
The current book chapter is based on the paper: “Maintenance data collection for subsea systems: A critical look at terms and information used for prediction of down time” , presented at the European Safety and Reliability (ESREL) conference in Portorož, Slovenia, June 18–22, 2017.