Abilities of “smart systems” for processing information, adaptation to conditions of uncertainty, and performance of scientifically proven preventive actions in real time are analyzed. Basic probabilistic models and technologies for the analysis of complex systems, using “smart systems,” ways of generation of probabilistic models for prognostic researches of the new systems projected, modernized, or transformed, are proposed. The proposed methods are described to predict risks to lose integrity for complex structures on the given prognostic time and rationale of preventive measures considering admissible risk, estimate “smart system” operation quality, and predict in real time the mean residual time before the next parameter abnormalities. The methods and technologies are implemented on the level of the remote monitoring systems. The application is illustrated on the examples of the joint-stock company “Siberian Coal Energy Company.”
- smart system
All next years and decades form an epoch of using smart systems. What about the usefulness of smart systems for prediction and rationale of preventive measures against possible threats? To answer this question, we address to some definitions.
According to ISO Guide 73, in general, case risk is defined as the effect of uncertainty on objectives. An effect is a deviation from the expected—positive and/or negative. Objectives can have different aspects (such as financial, health and safety, and environmental goals) and can be applied at different levels (such as strategic, organization-wide, project, product, and process). Risk is often characterized by reference to potential events and consequences or a combination of these. Risk may be estimated by a probability of potential events, leading to effects considering consequences. The chapter, including examples, is focused on events leading to losses of system integrity (often with negative consequences). But it does not limit a generality of proposed approaches.
According to ISO/IEC/IEEE 15288 “Systems and software engineering—System life cycle processes,” a system is a combination of interacting elements organized to achieve one or more stated purposes. An enabling system is a system that supports a system of interest during its life cycle stages but does not necessarily contribute directly to its function during operation. A system of systems (SoS) is a system of interest whose elements are themselves systems. A SoS brings together a set of systems for a task that none of the systems can accomplish on its own. Each constituent system keeps its own management, goals, and resources while coordinating within the SoS and adapting to meet SoS goals. The research covers systems defined in itself as “smart” system or using “smart” systems (see Figure 1).
For modern or perspective system or for a system of systems from the point of view of prediction and rationale of preventive measures against possible threats, the “smart” systems are and will be used as the systems in itself or as system elements or enabling systems. In a general case, “smart” is a mnemonic acronym, giving criteria to guide in the setting of objectives, and “smart systems” are defined as miniaturized devices that incorporate functions of sensing, actuation, and control (www.wikipedia.org, www.thefullwiki.org).
Developing existing researches [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], this manuscript includes correct probabilistic interpretation of risk prediction effectively using “smart” systems, some original basic probabilistic models for risk prediction, the improvement of existing risk control concept, and approaches for solving some problems of industrial safety for coal branch.
2. Probabilistic interpretation of risk prediction for effective using “smart” systems
Because “smart” possibilities allow to forecast a future, we should view probabilistic vision of event prediction, its scientific interpretation, and, unfortunately, some existing illusory vision. Here, from the scientific point of view for anticipating dangerous development of events, it is difficult to construct an adequate probability distribution function (PDF) [1, 2, 3, 4] of time between losses of system integrity. Damage may be to some extent estimated on practice (we will consider that the deviations in estimations can reach 100%). Therefore, leaving an estimation of a possible damage out of the work, we will stop on researches of a probabilistic component of risk. What deviations in risk predictions are possible here? To answer this question, it is necessary to understand typical metrics and engineering methods of risk predictions, in definition and concept to use “admissible risk,” and then to compare various variants.
In practice probabilistic estimations of system integrity losses are quite often carried out by the frequency of emergencies or any adverse events. For example, with reference to safety, it can be frequencies of different danger threats influences, leading to a damage. That is, frequency replaces estimations of probability (risk to lose integrity of system during prognostic period). It is correct? From probability theory it is known that for defined PDF one of its characteristics is the mathematical expectation (Texp.). In turn, for PDF of time between losses of system integrity, the mathematical expectation is the mean time between neighboring losses of system integrity Texp., and moreover the frequency
Often today, engineers prefer exponential PDF: R(t, λ) = 1 – exp. (−λ∙t). If, for example, for 1 year of prognostic period to put λ about 10−3 times in a year or less, then under Taylor’s expansion R(t, λ) ≈ λ∙t with accuracy o(λ2∙t2). And, if t = 1 year, the absolute value of frequency practically coincides with the value of probability. But if value λ∙t increases, it is capable to exceed 1 and by definition generally cannot be perceived as probability. Resume: focusing on probability is correct from the point of view of universal risk metric. And, focusing on frequency may be incorrect if λ∙t is approximately more than 10−3.
The special importance has the concept of “admissible risk.” The matter is that there should be a result of the consent of all parties involved in unsafe business on condition that it does not ruin business; by all it is unequivocally estimated and interpreted (not excluding emergencies) and is scientifically proven. In practice frequently the “admissible risk” is interpreted as “border strip,” i.e., it is supposed that if it does not cross this “border strip,” the system integrity cannot be lost. But in reality it is not so! The residual risk always remains. In operation research the similar restrictions are considered as a starting point for the decision of synthesis problems, connected with searching effective preventive measures of system integrity in life cycle. The complex use of these measures promotes in retaining the risk on the admissible level. It is the typical approach which should work correctly. And how does it work in practice?
Here, it is to quite pertinently address the developed form of the quantitative requirements, connected with the level of admissible risks. The elementary forms of requirements are:
“A frequency λ of system integrity losses should not exceed admissible level λadm.”
“Probability to lose integrity of system during time Treq should not exceed admissible level Radm. (Treq).”
What engineering explanations occur in practice? They are as follows:
If the limitation on the admissible level of probability Radm. (Treq) is set, it means that crossing “border strip” should not occur on an interval of time from 0 to Treq. For exponential approximations there is an unequivocal functional dependence: λadm. = − ln(1- Radm. (Treq)). That is, this dependence means that a given value of admissible probability Radm. (Treq) corresponds unequivocally with a value of the maximum frequency of system integrity losses.
If the limitation on the admissible level of maximum frequency of system integrity losses λadm. is set, it means that for exponential approximations the function of probability from time t is considered: R(t, λadm.) = 1 – exp. (−λadm.∙t). That is, this is the same “border strip” but in the form of the function from t and without an obvious binding to value Treq. This level of limitation by function Radm. (Treq) is logically to interpret also as “admissible” for the period of time from 0 to t. Admissible risk in the point of probability Padm. (treq.) for time treq. May be prolonged on the level of PDF by exponential distribution and the admissible frequency of system integrity losses λ = − ln(1- Рadm. (treq.)). It is convenient, but is it adequate? In reality a vision about exponential PDF for behavior of “smart” system may be roughly erroneous (see Figure 3).
Despite obvious incompleteness of the elementary forms of requirements to “admissible risks” (in reality, only the limitations in one or several points) and the absence of interrelations with a kind of real PDF of time between losses of system integrity (depending from many parameters: structure of system, heterogeneity of threats, different measures of counteraction to threats, etc.), these forms have got accepted by engineering community. In the further statement of the work, we will be guided by these elementary forms of requirements to “admissible risks.” They also allow to extract latent knowledge from the results of adequate probabilistic modeling.
Today, specifications of safety in different fields characterize a frequency
Accordingly, there is an important question: what frequencies of system integrity losses should be used for risk predictions and where does it take? If these are only the frequencies of emergencies, the predicted risks will be essentially underestimated! These final frequencies are output instead of input data for modeling. Estimate, please: if to be guided by these frequencies and to consider that 50–70% of failures are the result of “human factor,” it should mean that the frequency of critical errors from “human factor” on systems is about one time in thousand years! However, that is not so in real life! Errors are committed much more often. But they are under control, and the majority of them is in due time corrected. As consequence of these counteraction measures, required system integrity (including safety) is reached. The answer arises obviously: the frequency
Consideration of “smart” system possibilities for proactive diagnostics of system integrity, monitoring of conditions, and recovering the lost integrity allows to create more adequate PDF for risk predictions. In Figure 4 the limitations to admissible risks, fragment of exponential, and an adequate PDF of time between losses of system integrity with identical frequency of system integrity losses are demonstrated. The errors in comparison with vision in Figure 3 are noted.
An example when all requirements to admissible risk are met is presented on Figure 5. It is the right understanding of probabilistic vision of event prediction with scientific interpretation considering situations in Figure 4.
3. Some basic probabilistic models for risk prediction
Considering possibilities of “smart” systems, two general technologies of providing protection in different spheres are described: proactive periodical diagnostics of system integrity (technology 1) and additionally monitoring between diagnostics (technology 2) including recovery of integrity [2, 3, 6, 7, 8, 9, 10]. These models allow to create more adequate PDF of time before the next event of the lost integrity.
3.1. The models for the systems that are presented as one element (“black box”)
Technology 1 is based on proactive diagnostics of system integrity that are carried out periodically to detect danger occurrences into a system or consequences of negative influences. The lost system integrity can be detected only as a result of diagnostics, after which the recovery of integrity is started. Dangerous influence on system is acted step by step: at first a danger occurrence into a system and then after its activation begins to influence. System integrity cannot be lost before an occurred danger is activated. A danger is considered to be realized only after a danger has activated and influenced on a system. Otherwise, the danger will be detected and neutralized during the next diagnostic.
Note: it is supposed that used diagnostic tools allow to provide system integrity recovery after revealing of danger occurrences into a system or consequences of influences.
Technology 2, unlike the previous one, implies that operators alternating each other trace system integrity between diagnostics. In case of detecting a danger, an operator recovers system integrity (ways of dangers removing and system recovery are the same as for technology 1). Faultless operator’s actions provide a neutralization of a danger. When a complex diagnostic is periodically performed, this time operators are alternated. An occurrence of a danger is possible only if an operator makes an error, but a dangerous influence occurs if the danger is activated before the next diagnostic.
The probability of system operation with required safety and quality within the given prognostic period (i.e., probability of success) may be estimated as a result of using the next models for technologies 1 and 2. Assumption: for all time input characteristic, the probability distribution functions exist. Risk R(Treq) to lose integrity (safety, quality, or separate property, e.g., reliability), i.e., to be though one time in “red” range during period Treq, is addition to 1 for probability P(Treq) of providing system integrity (“probability of success,” i.e., to be in “green” or “yellow” ranges all period Treq). R(Treq) =1-P(Treq) considering consequences.
The next variants for technologies 1 and 2 are possible: variant 1—the given prognostic period Treq is less than established period between neighboring diagnostics (Treq < Tbetw. + Tdiag); variant 2—the prognostic period Treq is more than or equals to established period between neighboring diagnostics (Treq ≥ Tbetw. + Tdiag). Here, Tbetw. is the time between the end of diagnostic and the beginning of the next diagnostic, and Tdiag is the diagnostic time.
where Ωoccur(t) is the PDF of time between neighboring occurrences of danger (from the “green” to the “yellow” range), mathematical expectation Toccur = σ−1; Ωactiv(t) is the PDF of activation time of occurred danger (threat: from the first input at the “yellow” range to the first input in the “red” range), and mathematical expectation is β. The PDF Ωoccur(t) and Ωactiv(t) may be exponential (see rationale in ). For different threats a frequency of dangers for these PDF is the sum of frequencies of every kind of threats.
where N = [Тreq./(Тbetw. + Тdiag.)] may be real (for PDF) or the integer part (for estimation of deviations).
The probability of providing system integrity within the given time P(1)(Tgiven) is defined by Eq. (1).
Here, A(τ) is the PDF of time between operator’s errors. A(τ) may be exponential PDF (see rationale in ).
where the probability of providing system integrity within the given time P(1)(Treq.) is defined by Eq. (3).
The final clear analytical formulas for modeling are received by Lebesgue integration of expression (3).
The models are applicable to the system presented as one element. The main result of such system modeling is a probability of providing system integrity or of losses of system integrity during the given period of time. If a probability for all points Тreq. from 0 to ∞ will be calculated, a trajectory of the PDF for each combined element depending on threats, periodic control, monitoring, and recovery time is automatically synthesized.
3.2. Probabilistic approach to estimate “smart” system operation quality
In general case “smart” system operation always aims to provide reliable and timely producing the complete, valid and, if needed, confidential information for its proper further pragmatic use, including incorporate functions of sensing, actuation, and control. And, potential threats to “smart” system operation are influencing the used information (Figure 6).
In general case a probabilistic space (
“The model of function performance by a complex system in conditions of unreliability of its components” (the measures: TMTBF is the mean time between failures; Prel.(Тgiven) is the probability of reliable operation of IS, composed by subsystems and system elements, during the given period Тgiven; and Pman(Тgiven) is the probability of providing faultless man’s actions during the given period Тgiven).
“The model complex of call processing” (the measures for the different dispatcher technologies (for unpriority call processing in a consecutive order for single-tasking processing mode, in a time-sharing order for multitasking processing mode; for priority technologies of consecutive call processing with relative and absolute priorities; for batch call processing; for combination of technologies above): the mean wait time in a queue; the mean full processing time, including the wait time; Ptim is the probability of well-timed processing during the given time; the relative portion of all well-timed processed calls; the relative portion of well-timed processed calls of those types for which the customer requirements are met Ctim).
“The model of entering into IS current data concerning new objects of application domain” (the measure: Pcompl is the probability that IS contains complete current information about the states of all objects and events).
“The model of information gathering” (the measure: Pactual. is the probability of IS information actuality on the moment of its use).
“The model of information analysis” (the measures: Pcheck is the probability of error absence after checking; the fraction of errors in information after checking; Pprocess is the probability of correct analysis results obtained; the fraction of unaccounted essential information).
“The model complex of dangerous influences on a protected system” (the measures: Pinf.l.(Тgiven) is the probability of required counteraction to dangerous influences from threats during the given period Тgiven).
“The model complex of an authorized access to system resources” (the measures: Pprot is the probability of providing system protection from an unauthorized access by means of barriers; Pconf. (Тgiven) is the probability of providing information confidentiality by means of all barriers during the given period Тgiven).
These models, supported by different versions of software Complex for Evaluation of Information Systems Operation Quality, patented by Rospatent №2,000,610,272 (CEISOQ+), may be applied and improved for solving such system problems in “smart” system life cycle as rationale of quantitative system requirements to hardware, software, users, staff, and technologies; requirement analysis; estimation of project engineering decisions and possible danger; detection of bottlenecks; investigation of problems concerning potential threats to system operation and information security; testing, verification, and validation of “smart” system operation quality; rational optimization of “smart” system technological parameters; and rationale of projects and directions for effective system improvement and development.
3.3. The generation of new models for complex systems
The basic ideas of correct integration of probabilistic metrics are based on a combination and development of the offered models [2, 3, 6, 7, 8, 9, 10]. For a complex system estimation with parallel or serial structure, new models can be generated by methods of probability theory. For this purpose in analogy with reliability, it is necessary to know a mean time between losses of integrity for each element. Let’s consider the elementary structure from two independent parallel elements that means logic connection “OR” or series elements that means logic connection “AND” (see Figure 7).
The PDF of time between neighboring losses of ith element integrity is Вi(t) = Р (τi ≤ t); then:
(1) Time between losses of integrity for system combined from series connected independent elements is equal to a minimum from two times τi: failure of the first or second elements (i.e., the system goes into a state of lost integrity when either the first or second element integrity will be lost). For this case the PDF of time between losses of system integrity is defined by expression
(2) Time between losses of integrity for system combined from parallel connected independent elements (hot reservation) is equal to a maximum from two times τi: failure of the first or second elements (i.e., the system goes into a state of lost integrity when both the first and second element integrity will be lost). For this case the PDF of time between losses of system integrity is defined by expression
Note: The same approach is studied also by Prof. E. Ventcel (Russia) in 80th who has formulated the trying tasks for students.
Thus, an adequacy of probabilistic models is reached by the consideration of real processes of control, monitoring, and element recovery for complex structure. Applying recurrently expressions (5)–(6), it is possible to create PDF of time between losses of integrity for any complex system with parallel and/or series structure.
The known kind of the more adequate PDF allows to define accordingly mean time between neighboring losses of system integrity Texp. (may be calculated from this PDF as mathematical expectation) and a frequency λ of system integrity losses λ = 1/ Texp..
Risk to lose integrity (safety, quality, or separate property, e.g., reliability) is an addition to 1 for probability of providing system integrity (correct system operation or “probability of success”) R = 1−P. The formulas for probabilistic modeling technologies 1 and 2 and the proofs of them are proposed in [2, 3, 6].
All these ideas are implemented by the software technologies of risk prediction for complex systems, for example, the “mathematical modeling of system life cycle processes,” “know-how” (registered by Rospatent №2,004,610,858), and “complex for evaluating quality of production processes” (patented by Rospatent №2,010,614,145) [8, 9].
4. The improvement of existing risk control concept
Creation and perfection of probabilistic models for problem decision
Automatic combination and generation of new probabilistic models
Forming the storehouse of risk prediction knowledge
For storehouse, dozens of variants of the decision of typical industrial problems for risk control
Thus, the proposed improved that risk control concept can be useful to perform effectively functions: risk prediction; rationale of quantitative system requirements to hardware, software, users, staff, and technologies; requirement analysis; estimation of project engineering decisions and possible danger; detection of bottlenecks; investigation of problems concerning potential threats to operation of complex systems; validation of system operation quality; rational optimization of system parameters; and rationale of plans, projects, and directions for effective system utilization, improvement, and development. The expected pragmatic effect is as follows: it is possible to provide essential system quality and safety rise and/or avoid wasted expenses in system life cycle bases on the rational application of improved concept.
5. About some problems of industrial safety for coal branch
As an example of effectively solving the problems of industrial safety, we consider an experience of the joint-stock company “Siberian Coal Energy Company (SUEK)” (see www.suek.com). SUEK is one of the world’s largest coal companies with production assets in Russia and a global trading network. SUEK delivers long-term value to shareholders at every stage of the value chain—mining, processing, transportation, and shipment—through port facilities, sales, and distribution (Figure 10). This value chain includes different SoS. In practice many SoS of SUEK use “smart” systems [11, 14].
Below are the aspects researched:
Probabilistic analysis of the remote monitoring system (RMS) possibilities for increasing industrial safety of critical infrastructure safety (CIS).
Estimating in real time the mean residual time before the next parameter abnormalities considering the results of the control of equipment and technological process conformity to the set normative in real time.
5.1. Probabilistic analysis of the remote monitoring system possibilities
For coal branch the developments of mine, buildings, and constructions should be equipped by a complex of systems and means that provide the organization and implementation of coal work safety and technological and productions control in normal and emergency conditions. This complex of systems and means should be integrated into multifunctional safety system (MFSS) with the following main functions:
Monitoring and prevention of conditions of occurrence of geodynamic, aerologic, and technogenic danger.
The control of technological process conformity to the set normative in real time.
Application of counteremergency protection systems.
The usual approaches to critical infrastructure safety (CIS) which have been developed in last dozen years, based on many respects on subjective safety estimations “on places”, have reached a high but not sufficient level of efficiency. For the account of interests of all interested parties and the further business development today, rethinking system possibilities of applied information technologies for increasing safety and extracting the innovative effects are not used fully till now.
Search of cardinal directions of improving CIS, favorable to business and the state, has led to comprehension of sharp necessity and expediency of creation and implementation of remote monitoring system (RMS) that is an important part of MFSS. RMS transforms an internal information support of separate CIS in a mode of a needed transparency and wide availability of CIS state in real time for all interested and responsible parties. Along with it on the basis of rational RMS implementation, the transition from the existing subjective expert approach to the risk-based approach for critical infrastructure safety receives necessary information filling.
The proposed probabilistic analysis of RMS operation in their influence on integral risks to lose system integrity is based on researching real remote monitoring systems implemented in Russia for oil and gas CIS. In application to composed and integrated CIS with RMS and without RMS, the earlier models, developed by authors, are used [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The received results are applicable for an analytical rationale of system requirements to RMS, system definition of the balanced preventive measures of systems, and subsystem and element integrity support at limitations on resources and admissible risks.
Requirements to monitoring and prognosis for critical systems are established at the level of many international standards, for example, ISO 17359, ISO 13381–1, ISO 13379, IEC 61508–1 [18, 19, 20, 21], etc. Today, a monitoring of parameter conditions is carried out to increase reliability and industrial safety of critical systems, improve their health management, and provide predictive maintenance and operation efficiency. Here, critical systems are understood as objects of dangerous manufacture and the equipment, energy objects, power and transport systems, and others. Different data about current conditions of parameters become accessible in real time. So, for coal mine some of many dozens of heterogeneous parameters are for ventilation equipment (VE) (temperature of rotor and engine bearings, a current on phases, and voltage of stator) and for modular decontamination equipment (MDE) (vacuum in the pipeline, the expense and temperature of a metano-air mix in the pipeline before equipment, pressure in system of compressed air, etc.). Effects from RMS may be reached on the basis of gathering and analytical processing in real time the information on controllable parameters of objects monitored (see Figure 11). RMS is intended for a possibility of prediction, the prevention of possible emergencies, minimization of a role of human factor regarding control, and supervising functions. The role of RMS is defined by their functions, to the basic of which concern:
Remote continuous monitoring of CIS condition in real time (gathering data about key parameters of technological processes; gathering and processing data of industrial inspection, the information of technical condition and equipment diagnostics, and the information on the presence of failures and incidents; and results of system recovery, etc.)
Analytical data processing
Prediction of risks to lose object integrity
Display of parameter conditions and predictions with the necessary level of details
In this subsection analytical decomposition and the subsequent integration of complex systems are used according to propositions above in Sections 1–4. Admissible conditions (ranges) of traced parameters for each element, the reservation possibilities, implemented technologies of the control, and recovery of integrity are considered.
RMS is intended for a possibility of prediction, the prevention of possible emergencies, minimization of a role of human factor regarding control, and supervising functions. It may be reached on the basis of gathering and analytical processing in real time the information on controllable parameters of objects monitored. For example, objects monitored for oil and gas CIS are the technological equipment and processes of extraction, transportation, refining, the personnel, systems, and means of safety support.
The role of RMS is defined by their functions, to the basic of which concern:
Remote continuous monitoring of CIS condition in real time (gathering data about key parameters of technological processes; gathering and processing data of industrial inspection, the information of technical condition and equipment diagnostics, and the information on the presence of failures and incidents; and results of system recovery, etc.)
Analytical data processing
Prediction of risks to lose CIS integrity
Display of CIS conditions and predictions with the necessary level of details
Unlike the usual control which is carried out at enterprises (when the state supervising body in the field of industrial safety and frequently also the enterprise/holding bodies of the industrial safety control receive the information only upon incident or failure, not possessing the actual information about deviations at an initial stage when still it is possible to prevent failure), RMS translates the control, a transparency of CIS conditions, the important real-time information (about the facts and predictions), and also necessity of proper response to critical deviations for absolutely new time scale characterized as the scale of real time, measured by seconds-minutes.
Effects from the remote control can be reached only when quality of RMS operation is provided. It means that it is reliable and timely producing the complete, valid, and, if needed, confidential information by RMS.
Generally, system analysis of RMS operation consists in evaluation of reliability, timeliness, completeness, validity, and confidentiality of the used information. In special cases for compound subsystems and system elements, not all measures may be used. For example, for a subsystem of information security enough to use the measures to evaluate protection from an unauthorized access and information confidentiality during the given time period. Dependence of the purposes of researching RMS can be decomposed to the level of compound subsystems and separate elements (see Figure 6).
In this case according to the system engineering principles, the operation quality of every component should be evaluated.
For evaluating integral RMS operation quality, the next measure is proposed: the probability of providing reliable and timely representation of the complete, valid, and confidential information during the given time (РRMS(Тgiven)).
In general case
where all measures are calculated by the models proposed in Section 2.
For complex structures the ideas of combination of the models is proposed in . It allows in an automatic mode to generate new models at the expense that there is possible evaluation of the measures above.
When not all system elements and subsystems are captured by RMS capabilities, two subsystems, operated in different time scales, are cooperated in the CIS. A part of CIS, captured RMS, is served in real time, and the other part is in a usual time scale (with information gathering by a principle “as it is possible to receive”). In many critical situations, this usual time scale cannot be characterized as adequate to a reality. With the use of the offered approach, the system with usual control (UC), used for CIS, i.e., without RMS application, can be estimated. Generally, the analyzed critical infrastructure is presented as a combination “System+RMS” and usual “System without RMS.” And, “System+RMS” is a combination of “Structure for RMS” and “RMS” (see Figure 12). For these systems some measures of the information delivery may not answer requirements of real time—“System+RMS” because RMS operation quality is inadequate and “System without RMS” without RMS.
All the great number of the factors characterizing threats to analyzed critical infrastructure is considered as 100%, and total frequency of dangerous deviations is designated through λ∑. Frequency of potentially dangerous deviations traced by “System + RMS” is designated (λRMS). Frequency of occurrence of other potentially dangerous deviations which are not traced by RMS (i.e., for “System without RMS”) is designated (λ∑ − λRMS).
For “System + RMS” the RMS operation quality during the time of prediction Тgiven is evaluated by probability РRMS(Тgiven). And, the risk of critical deviation for safety during the time of prediction Тgiven, designated as RRMS(Тgiven), can be evaluated by the earlier methods [2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. For the usual “System without RMS,” the same measures РUC(Тgiven) and RUC(Тgiven) can be used with specified value of input for probabilistic modeling.
Then, in general form, the risk R(Тgiven) to lose integrity for analyzed critical infrastructure during the time of prediction Тgiven can be evaluated by the formula:
where expression in square brackets is a probability of successful operation of analyzed critical infrastructure. Depending on the made risk definition in special cases, it can be interpreted as probability of safe or reliable operation or probability of norm observance for critical parameters of the equipment or others in the conditions of associated potential threats. The case λ∑ = λRMS means full capture of critical infrastructure by RMS capabilities.
5.2. Estimating the mean residual time before the next parameter abnormalities
Unfortunately, in the world the universal approach to adequate prognosis of the future parameter conditions on the basis of current data is not created yet. The uncertainty level is too high. Nevertheless, in practice for each concrete case, subjective expert estimations, regression analysis of collected data, and simulation are often used. And, probabilistic models applied in some cases contain many simplifications, and they frequently do not consider an infrastructure of complex systems, heterogeneity of threats, distinctions in technologies of the control, and recovery of integrity for various elements of these systems [2, 3]. The same aspects and also rarity of many random events (with some exceptions) do an ineffective statistical estimation of residual time before the next parameter abnormalities. At the same time, scientifically proven prognosis of a residual time resources is necessary for acceptance of preventive measures on timely elimination of the abnormality reasons. The above-stated characterizes an actuality of this and similar researches for different industrial areas [11, 12, 13, 14, 15, 16, 17].
Traced conditions of parameters are data about a condition before and on the current moment of time, but always the future is more important for all. With the use of current data, responsible staff (mechanics, technologists, engineers, etc.) should know about admissible time for work performance to maintain system operation. Otherwise, because of ignorance of a residual time resource before abnormality, the necessary works are not carried out. That is, because of ignorance of this residual time, measures for prevention of negative events after parameter abnormalities (failures, accidents, damages, and/or the missed benefit because of equipment time out) are not undertaken. And, on the contrary, knowing residual time before abnormality, these events may be avoided, or the system may be maintained accordingly. For monitored critical system, the probabilistic method to estimate the mean residual time before the next parameter abnormalities for each element and whole system is proposed.
By principles of system engineering (e.g., according to ISO/IEC/IEEE 15288), the complex system is decomposed to compound subsystems and elements with formal definition of states (see Figure 13).
For every valuable subsystem (element), monitored parameters are chosen, and for each parameter, the ranges of possible values of conditions are established: “In working limits,” “Out of working range, but inside of norm,” and “Abnormality” (interpreted similarly light signals (“green,” “yellow,” “red”) (see Figure 14). The condition “Abnormality” characterizes a threat to lose system integrity (on the logic level, this range “Abnormality” may be interpreted analytically as failure, fault, unacceptable risk or quality, etc.).
For avoiding the possible crossing of a border of “Abnormality,” a prediction of residual time, which is available for preventive measures, according to gathered data about parameter condition fluctuations considering ranges, should be carried out. For prediction the following are proposed: (1) a choice of probabilistic models for construction (PDF of time before the next abnormality for one element (“black box”)), (2) development of the algorithm of generation (PDF of time before the next abnormality for complex system), and 3) formalization of calculative methods of estimating the mean residual time before the next parameter abnormalities for monitored critical system.
The method allows to estimate residual time before the next parameter abnormality (i.e., time before the first next coming into “red” range) .
The method allows to estimate residual time before the next parameter abnormality Tresid(1) for a given admissible risk Radm.(Treq) to lose integrity. The estimated Tresid(1) is the solution t0 of equation:
concerning of unknown parameter t, i.e., Tresid(1) = t0.
Here, R(Toccur, t, Tbetw, Tdiag, Тerr., Treq.) is the risk to lose integrity; it is addition to 1 for probability P(Treq) of providing system integrity (“probability of success”), and for calculations formulas (1)–(7) are used (see SubSection 3.1 of this article). So, for exponential PDF, formula (1) transforms into formula.
This formula is used for Eq. (7).
Toccur is the mathematical expectation of PDF Ωoccur (τ); it is defined by parameter statistics of transition from “green” into “yellow” range (see Figure 3). The other parameters Tbetw and Tdiag in formula (7) are known. The main practical questions are as follows: what about Treq. and what about the given admissible risk Radm.(Treq)? For answering we can use the properties of function R(Toccur, t, Tbetw, Tdiag, Тerr., Treq.):
If parameter t increases from 0 to ∞ for the same another parameters, the function R(…, t, …) is monotonously decreasing from 1 to 0, i.e., if the mean activation time of occurred danger (threat: from the first input at the “yellow” range to the first input in the “red” range) is bigger, to lose integrity is less.
If parameter Treq increases from 0 to ∞ for the same other parameters, the function R(…,Treq) is monotonously increasing from 0 to 1, i.e., for large Treq risk approaches to 1.
It means that the such maximal x exists when t = x and Treq. = x and 0 < R(Toccur, x, Tbetw, Tdiag, Тerr., x) < 1. That is, the residual time before the next parameter abnormality (i.e., time before the first next coming into “red” range) is equal to the defined x with the confidence level of admissible risk R(Toccur, x, Tbetw, Tdiag, Тerr., x).
So, if Toccur = 100 days, for Radm. = 0.01 residual time x ≈ 2.96 weeks (considering decisions of recovery problems of integrity every 8 hours).
The method is implemented by RMS. At once after crossing “yellow” border from “green,” the automatic prediction of the mean residual time before the next parameter abnormalities (from the first input at the “yellow” range to the first input in the “red” range) is displayed (see Figure 15).
Adequate reaction of responsible staff in real time is transparent for all interested parties.
5.3. About some effects from adequate probabilistic methods and technology applications
Some effects from the proposed adequate probabilistic methods and technologies of RMS are estimated on the level of predicting risks to lose object safety (integrity) by PDF .
Example 5.3.1. According to statistics from multifunctional safety system (MFSS), a frequency of occurrence of the latent or obvious threats is equal to once a month, and an average time of development of threats (from occurrence of the first signs of a critical situation up to failure) is about 1 day. A work shift is equal to 8 hours. The system control is used once for work shift, and a mean duration of the system control is about 10 minutes (it is supposed that recovery of object integrity is expected also for 10 minutes). The workers (they may be mechanics, technologists, engineers, etc.) of medium-level and skilled workers are capable to revealing signs of a critical situation after their occurrence, and workers of the initial level of proficiency are incapable. Medium-level workers can commit errors on the average not more often once a month, and skilled workers are not more often once a year. How consideration of the qualification level influences on predicted risks to lose object safety for a year and for 10 years?
The results of modeling. For workers of the initial level of proficiency, risks to lose object safety are near 1 (losses of integrity are inevitable). For workers of medium-level of proficiency, risk to lose object safety for a year is about 0.007 and for 10 years is about 0.067, and for skilled workers, risk equals to 0.0006 for a year and 0.0058 for 10 years because of effective monitoring using RMS possibilities.
Example 5.3.2. We will concentrate on the analysis of errors of skilled workers from the point of object safety. Raising adequacy of modeling, in addition to initial data of Example 5.3.1, we will consider that mean recovery time of the lost integrity of object equals to 1 day instead of 10 minutes . What effect may be from risk prediction?
Calculated PDF fragment shows (see Figure 16) that risk to lose object safety increases from 0.0006 (for a year) to 0.0119 (for 20 years). Thus, the calculation from PDF mean time between neighboring losses of object safety Tmean equals to 493 years. That is, the frequency λ = 1/ Tmean of system safety losses is about 0.002 times a year. It is 6000 times less (!) in comparison with a primary frequency of occurrence of the latent or obvious threats (once a month). And, estimated Tmean is almost 500 times more in comparison with a primary mean time between errors of skilled workers (once a year). And, such effect can be reached at the expense of undertaken control measures, monitoring, and system recovering in case of revealing in time the signs of threat development. To the point, the frequency λ of system safety losses is extracted latent knowledge from PDF, built in a calculated form.
If to compare with exponential approximation of PDF with the same frequency λ, the risk to lose object safety will grow from level 0.002 (for a year) to 0.04 (for 20 years). These are also extracted latent knowledge considering Taylor’s expansion R(t, λ) ≈ λ∙t (see Section 2). Difference is in 3.3–3.4 times more against adequate PDF. To feel, it is enough to ascertain that for created PDF the border of admissible risk 0.002 will be reached for 3 years, not for 1 year as for exponential PDF. That is, the real duration of effective object operation (i.e., without losses of safety) is three times more!
This allowed to estimate operation of object as “black box,” described by characteristics of skilled workers. On dangerous manufacture critical operations are carried out by skilled workers in interaction with RMS (including reservation and supports of another). Formally, they operate as parallel elements with hot reservation. Thereby, the consideration of such interaction allows to increase adequacy of modeling. Let’s estimate risk to lose object safety for this variant (all input data for each from two parallel elements are the same that in Example 5.3.2).
Calculated PDF fragment shows (see Figure 17) that risk to lose object safety increases from 0.0000003 (for a year) to 0.00014 (for 20 years). Thus, the mean time between neighboring losses of object safety Tmean, calculated from known PDF, equals to 663 years. That is, the frequency λ of system safety losses is about 0.0015 times a year. It is 8000 times less (!) in comparison with a primary frequency of occurrence of the latent or obvious threats (once a month). And, at the expense of reservation estimated, Tmean is 34.5% longer in comparison with Tmean from Example 5.3.2.
If to compare with exponential approximation of PDF with the same frequency λ, the risk to lose object safety will grow from level 0.0015 (for a year) to 0.03 (for 20 years). Difference is in 200–5000 times more against adequate PDF. The border of admissible risk 0.0015 will be reached for 195 years, not for 1.3 year as for exponential PDF. That is, the real duration of effective object operation (i.e., without losses of safety) is 150 times more! Such effect can be reached at the expense of mutual aid (reservation and supports) of skilled workers using RMS.
Example 5.3.4. Come back to the SUEK value chain (see Figure 10). According to system engineering principles (see ISO/IEC/IEEE 15288 and Figure 1), we decompose logically this chain into nine serial components. Components from 1 to 6 are united by MFSS of mine, component 7 is associated with washing factory, component 8 is associated with transport, component 9 is associated with port (see Figure 18). For every element of this chain, a specific set of threats exists. Let us analyze a system of such value chain. The typical systems of this value chain, including MFSS, are:
The control system of ventilation and local airing equipment.
The system of modular decontamination equipment and compressed air control.
The system of air and gas control.
The system of air dust content control.
The system of dynamic phenomena control and forecasting.
The system of fire prevention protection.
The safety system of washing factory.
The safety system for transport.
The safety system of port.
What about the safety for analyzed value chain for existing threats considering possibilities of remote monitoring systems (RMS), covering all components of chain?
Let’s put that the workers, interacted with RMS, participate in each chain process. Their activity is modeled by the models of Section 3, considering examples above. The high adequacy is reached by decomposition of chain system to nine logical subsystems, each of which implements corresponding typical functions of Systems 1–9. Safety of whole value chain system is provided, if “AND” the first subsystem, “AND” the second, …, and “AND” the ninth subsystem safety are provided (see Figure 18). Reservation of elements for every subsystem is explained by RMS possibilities. Those input data for every element are the same as in Example 5.3.3.
Calculated PDF fragment shows (see Figure 19) that risk to lose safety increases from 0.000003 (for a year) to 0.0013 (for 20 years). Thus, the mean time between neighboring losses of safety Tmean equals to 283 years. That is, the frequency λ of system safety losses is about 0.0035 times a year. It is 2.3 times more often against the results of Example 5.3.3. In comparison with a primary frequency of occurrence of the latent or obvious threats (once a month), the frequency λ is 3430 times lower!
For exponential approximation of PDF with the same frequency λ, the risk to lose safety will grow from level 0.0035 (for a year) to 0.07 (for 20 years). Difference is in 54–1167 times more against adequate PDF.
The border of admissible risk 0.002 will be reached for 24 years, not for 7 months as for exponential PDF (see Section 2). That is, the real duration of effective operation (i.e., without losses of safety) is 41 times more!
Example 5.3.5. How much risks will increase, if in a system of value chain from Example 5.3.4 only medium-level workers are used?
Calculated PDF fragment shows (see Figure 20) that risk to lose safety increases from 0.0009 (for a year) to 0.25 (for 20 years). Thus, the mean time between neighboring losses of safety Tmean equals to 24 years. That is, the frequency λ of system safety losses is about 0.04 times a year. It is 11.4 times less often against the results of Example 4 for skilled workers. In comparison with a primary frequency of occurrence of the latent or obvious threats (once a month), the frequency λ is 21 times lower!
For exponential approximation of PDF with the same frequency λ, the risk to lose safety will grow from level 0.04 (for a year) to 0.55 (for 20 years). Difference is 2.2–44.4 times more against adequate PDF. The border of admissible risk 0.002 will be reached for 2 years, not for one month as for exponential PDF. That is, the real duration of effective operation (i.e., without losses of safety) is 24 times more!
6. Instead of conclusion
The proposed probabilistic methods help the system using “smart systems”:
To predict risks to lose integrity for complex structures on the given prognostic time
To rationale of preventive measures considering admissible risk
To estimate “smart system” operation quality
To predict in real time the mean residual time before the next parameter abnormalities
The algorithm of creating more adequate PDF of time between losses of system integrity, considering for every element different threats, possibilities of control, monitoring, and recovery, allows to improve accuracy of probability predictions in hundred-thousand times (!) in comparison with exponential approximation.
The purposed approach allows to improve existing risk control concept, including creation and perfection of probabilistic models for problem decision, automatic combination, and generation of new probabilistic models, forming the storehouse of risk prediction knowledge; for storehouse, dozens of variants of the decision of typical industrial problems for risk control.
The application of the methods and technologies by the joint-stock company “Siberian Coal Energy Company,” implemented on the level of the remote monitoring systems, allowed to rethink system possibilities for increasing reliability and industrial safety, improve multifunctional safety systems, decrease risks, and provide predictive maintenance and operation efficiency in company value chain.