Open access peer-reviewed chapter

Perspective Chapter: PRA and Protective System Maintenance

Written By

Ernie Kee and Martin Wortman

Reviewed: 17 January 2023 Published: 17 February 2023

DOI: 10.5772/intechopen.110049

From the Edited Volume

Nuclear Fission - From Fundamentals to Applications

Edited by Pavel Tsvetkov

Chapter metrics overview

60 Chapter Downloads

View Full Metrics

Abstract

The processes used to manage protective system equipment failures as they relate to Probabilistic Risk Assessment (PRA) in the commercial nuclear power setting are reviewed. Efficacy of protection is governed by maintenance policy that includes system modification, maintenance inter-arrivals as a function of time, and upset inter-arrivals as a function of time. Such a maintenance policy is the one used in nuclear power plant protective systems. Observations described in this article include the impact of time-dependent activities associated with maintenance policy as they relate to endogenous and exogenous upset inter-arrival times. Methods evaluating maintenance policy reliant on combinatorial logic, such as PRA, fault trees, or event trees, may lead to ineffective maintenance policy decision-making for protective system efficacy. Recommendations for maintaining effective protections, and connections to engineering maintenance practice and regulations are made based on the implications that come from our observations. The importance of the issues described herein is that the relationship of design, maintenance, and repair policies must be properly understood and taken into account by process owners, operators, and investors, as well as regulators who specify and enforce protections in hazardous processes.

Keywords

  • PRA
  • protection
  • protective systems
  • maintenance efficacy
  • nuclear power

1. Introduction

On March 28, 1979 the United States commercial nuclear power program experienced its first major accident; the reactor core in Unit 2 at the Three Mile Island nuclear plant site in Harrisburg, Pennsylvania had overheated and melted down. Even considering this accident, it can be said, based on the safety record of commercial nuclear power in the United States, that the Nuclear Regulatory Commission (NRC) has successfully produced a regulatory structure that effectively manages risk of radioactive releases, especially from commercial nuclear power plants.1 With the benefit of hindsight, this accident would help inform stakeholders to better manage risk from nuclear power plant accidents. Within the nuclear power industry, the Three Mile Island accident motivated development of the predictive modeling methods and risk analytics commonly in use today. Thus, before exploring the details of reactor protection risk analytics, it is important to review key engineering insights gained following the Three Mile Island accident. Only then can we understand how these insights are either accommodated or fail to be accommodated in risk analysis methods such as PRA.

1.1 The benefit of hindsight

As described by Kemeny (see [1], p. 43) and Rogovin (see [2], vol. 1, p. 12), the core melt accident at Three Mile Island started when an operator tried to clear a plugged system that supplied water to the primary reactor heat removal system used in electricity production (see [3], for decay heat process details). The operator became involved when a valve installed for the purpose of bypassing around the plugged system was not open ([1], pp. 47–48). Although the actions taken by the operator to clear the plugging were unsuccessful, the heat removal system was supplied with a backup protective system using a completely separate water supply. Unfortunately, the backup system’s valves were inadvertently left shut, rendering it ineffective ([1], p. 47). The NRC requirements imposed through regulation and designed by the plant owners and investors anticipated such a loss of cooling sequence so a fourth separate protective system was supplied that would directly inject cooling water into the reactor core to keep it cool. Up to this point, the reactor system had begun heating up causing a relief valve to open, as designed, to reduce pressure.2 Unfortunately, the relief valve failed to close when the pressure dropped back to normal, a malfunction that went unrecognized by the operators. However, the fourth method provided to keep the reactor core from overheating started automatically and began cooling the reactor core. Thus far in the accident sequence, the status of the reactor core protections could be summarized as: (a) main water supply—plugged up, (b) bypass protection of main supply—valve shut, (c) third water supply system—valve shut, (d) reactor system relief valve—stuck open, (e) fourth cooling system—working and cooling.

Because the relief valve stayed open too long, the reactor pressure kept dropping and the water around the core began expand as it boiled. Water from the reactor core was then pushed into a surge volume where the reactor system water level is measured. Although water was actually being continuously lost from the core through the stuck relief valve, the operators were under the impression, based on the surge volume level, that the reactor system was filling up by the fourth (and final) cooling system and they shut it off. At this point, the plant entered a state where it would lose the ability to prevent the reactor core from melting and releasing radioactive material. An interesting detail is that even the malfunction of the relief valve sticking open was also anticipated and therefore a shutoff valve was provided; temperature measurements were put in place so the operators could understand if the relief valve was stuck open and they would know when the shutoff valve should be closed. The operators failed to close the valve even though the temperature measurement indicated they should ([1], p. 46).

Although the several protections put in place would make it almost unimaginable the reactor core could melt, the NRC required a final protective system designed to protect the public in accidents releasing radioactive material from the reactor core. This final protection was a building that could withstand a very high pressure burst and remain effectively airtight. This building is called the “containment building” and it is the final barrier to release of radioactive material in the event of a serious accident.3 Because the containment building was able to contain the radioactive materials until they could be properly managed, the public was not exposed to excessive radioactive material. The following considers what went wrong at Three Mile Island, what went right, and why?

1.1.1 What went wrong?

A complete description of the accident leading to core melt is given by Kemeny, but the main events could be summarized as: (a) the main cooling water supply was plugged up, (b) the backup system to keep the main water supply flowing was shut off, (c) the third backup system cooling water supply systems’ valves were shut, (d) the reactor pressure relief system (itself a protection against overpressure) caused the reactor system to lose water due the valve sticking open and, (e) the fourth backup water system was shut off due to factors such as: incorrect data interpretation; unobserved data or; misunderstanding the thermodynamics of isenthalpic expansion.

Actions by the operator working on the main water supply system caused the main flow path to shut off (an “initiating event”, see Sections 1.4, 3.2 and 5.3 and definition 5.2). The system that plugged up is added for chemical cleanup of the main water flow (see [1], p. 94). Although the possibility the chemical cleanup system could clog up was anticipated and protection had been added, the protective system failed to work because its bypass valve was not open; (see [1], p. 47–48), (see [2], vol. 1). The backup system with its valves shut (third main cooling system) did not operate for at least 8 min after it was needed. However this may not have directly made a difference in the core melt outcome (see [1], pp. 47, 91, 94). The reactor system relief valve opened at the right time but failed to close when the pressure had returned to normal; as a consequence, water continued to be released from the reactor system (see [1], pp. 28–29, 90–94). The fourth backup system that could have prevented the melt sequence started automatically and was working well but it was shut off by the operators (see [1], pp. 28, 89, 91, 93–94).

1.1.2 What went right?

As the accident progressed to melt, radioactivity monitors inside the plant began to show increasing levels of radiation. The radiation levels plus indications of increasing levels of water inside the reactor containment building gave the operators indications that an abnormal condition was present and worsening. Other indications, although working properly, were not seen or interpreted incorrectly by the operators thereby allowing the sequence to progress to core melt. By properly interpreting indications and taking appropriate actions in response, the operators may have kept the event from progressing. Even though instrument indications were “pointing” them to the possibility a dangerous event was in progress, the operators consistently failed to interpret critical indications correctly. It can be said that despite several backup systems and indications, the final most important element that went right was the NRC required a containment building acting as the final, in this case the fifth, barrier against radioactive material exposure to the public.

1.1.3 Why?

The things that went wrong at Three Mile Island were mostly attributable to human errors of commission or errors of omission brought on by equipment breakdowns: (a) error of commission, the failed attempt to unplug the main water line, (b) error of omission, the failure to open the bypass valve, (c) error of omission, leaving the emergency water line valves shut, (d) error of commission, turning off the fourth protective system, (e) the reactor system relief stuck open and, (f) the main water supply plugged up. That is, the operators either took action that was wrong, error of commission, or did not take action when required, error of omission. These errors can be attributed to causes such as lack of training, lack of engineering insight, carelessness, or simply not following procedures. These kinds of errors are difficult to eliminate even when attempts are made to overcome them. For example, a second check of the required valve position would have been a good idea for the emergency valves; however this method is not always effective.4

1.2 Probability, reality, and maintenance policy

One may be tempted to look back at the Three Mile Island accident and argue the probability for the accident was extremely unlikely. Acting on such temptations that is, applying ex-ante probabilities to ex-post observations, is improper. The accident at Three Mile Island Unit 2 actually happened rendering the ex-post probability irrelevant. Prior to the accident some experts would say that the probability was very small, possibly on the order of 1 chance in a million. Of the many scenarios anticipated in experts’ analyses of an ex-ante probability, the exact scenario that unfolded was unanticipated making their analyses irrelevant. Personnel misinterpreting indications, the actions taken due to the particular relief valve failure, and errors of commission were unanticipated as a scenario. The procedure for working on the chemical system could have required the operator to place the second protective system in service before starting the work as noted by Kemeny on page 47 of his report. In fact there are effectively limitless scenarios that may play out in the future. Solberg and Njå [4] state in their article “Reflections on the ontological status of risk”, only one of the many (truly infinite) possible scenarios can play out to produce a current “state of affairs”:

An important observation when we consider change or events is that when an event (a specified state of affairs) is manifested (when it happens) all other logically possible states of affairs are simultaneously excluded from manifestation. For this to be the case it would imply that there exists a whole range of possible future states of affairs, but only one of these would manifest at some point in time (the present).

(Solberg and Njå, 2012)

The discussion above is meant to point out why the accident at Three Mile Island happened even though the NRC and the plant design engineers endeavored to put in place protective systems intended to prevent progression to core melt. Before the accident occurred, a concerned citizen might question the design engineer in a conversation like the following:

Citizen, “What if the main cooling system gets plugged up?”

Engineer, “We thought of that and put in a bypass valve”.

Citizen, “What if the bypass valve doesn’t work?”

Engineer, “We also thought of that possibility and added a separate system to pump water in. We also know that if this happens, the reactor system will heat up causing pressure to increase. But before you ask, we added relief valves for this possibility”

Citizen, “What if the relief valve doesn’t work, will the system over pressurize?”

Engineer, “We thought of that too and added extra safety valves that would prevent overpressure”

Citizen, “But even so, there is no more cooling system left”

Engineer, “We put in yet another separate system to pump water in the reactor system just in case the other systems don’t work”

At this point the citizen might reasonably stop asking questions, satisfied the NRC and the plant engineers had really thought of everything that could go wrong in the future.

It is clear that when a substantial hazard is present in a technological system, many questions need to be asked and answered, and many scenarios considered in detail. At Three Mile Island, the particular hazard was the need to remove decay heat from the reactor system over a relatively long period of time. Even though many questions may be asked and answered, once a technological system is started, we take a leap into the unknown. No one can say for sure what future scenarios may play out. Some may lead to great loss, injury, or even death. The maximum level of harm that could result from various failures in a technological system is almost certainly known by the engineers who design it. On the other hand, unless large datasets are available, the numerical probability value they would would occur cannot be known. Depending on the value of the technological system to the social welfare, those valuable enough to be allowed by law are most likely subject to regulation designed to protect citizens’ safety and health.

1.3 Protective systems

Technological systems required by regulation are referred to as “protective systems”. Such protective systems overlay technological systems operated for the purpose of maximizing profit. They differ from production systems in that they do not create products with a utilitarian purpose. Instead, protective systems function to reduce the probability that citizens would need to backstop losses from risks taken in owner and investor activities that exceed the investors’ asset value. No one, not those maximizing profit, not the regulator, not the engineers, and not any independent scientific authority can know the probability or likelihood that a technological system in use will or will not cause harm with exactitude.

Protective systems add to the cost of goods and services. In so doing, they will generally balance the reduction of profit margins against the price citizens are willing to pay for the goods and services produced. In competitive markets, profit maximizers will look for opportunities to reduce costs imposed by protective systems. A subtlety created by the process of cost reduction is that profit maximizers will tend to look at all costs that go against their revenue including production costs in maintenance and operation. A reasonable question would be “does it matter to citizens if the maintenance and operational costs of the production systems are reduced if citizens require the regulator to make sure the protective systems are maintained up to standards?” The subtlety created by reducing costs in technological systems other than protective systems is that protective systems are required to protect against harms more often than if the production systems were operating smoothly. Regarding the Three Mile Island accident, Kemeny points out that,

Review of equipment history for the 6 months prior to the accident showed that a number of equipment items that figured in the accident had had a poor maintenance history without adequate corrective action. These included . . . the condensate polishers.

(Kemeny, 1979)

Failure of the condensate polishers at Three Mile Island triggered the sequence of events leading up to the core melting.5 This “trigger” is commonly referred to as an “initiating event”; the first event in a series of events that may end with unpleasant consequences. he importance of these initiating events is that the more often they occur, the more often the protective systems will be called upon to operate and because their failure probability can not be known, citizens should be concerned if the maintenance and operational costs of the production systems are reduced to the point more triggers occur.

Whenever a protective system allows an initiating event to progress to harm, it is made obsolete by revision that would address the root cause of its failure. The NRC made substantial changes to protective system requirements in the nuclear reactors it regulates following the Three Mile Island accident. No one can know the probability a technological system that has the potential for harm will, or will not, cause harm over its useful lifetime in the absence of substantial data sets where harm has occurred. This well-known principle is classically described by Cardano and Wilks [5] and more formally by Bernoulli [6].6 As technological systems are operated, protective systems that overlay them should be revised as necessary in light of new information to the point the technological systems are judged to be less harmful than the harms they are designed to overcome. Because protective systems add cost to the production of goods and services produced by technological systems, there is a balance, driven largely by market forces, to be struck among the cost of production, profit margin, and protection.

1.4 Predictive modeling for safety: critical protections

The concept of PRA is introduced in here at a very high level. In practice, PRA and other methods used to quantify probabilities and frequencies, are fundamentally derived from logical descriptions of a technological system’s response to an “upset”, or initiating event. Figure 1 is an example of such a logical description that envisions a technological system that “operates”, Z, on an initiating event input, I, to produce “outputs”, O. Z creates at least 2n outputs by assigning probabilities that split I at each of n devices. The splits are conditional such that if P is the probability for failure of the device, one branch of the split is assigned P, and the other P¯.

Figure 1.

A primitive event sequence logic for a protective system having three devices in parallel (1/3 logic).

Such probability models, with various levels of sophistication, are used to characterize the efficacy of safety-critical protections.7 Understanding protective system modeling discussed here necessarily assumes a postgraduate familiarity with stochastic processes distinguishing it from the treatment PRA practitioners use. Stochastic dynamics of safety-critical protections are defined in a general setting allowing treatment of efficacy analysis in a top-down manner starting at general predictive modeling followed by exploration of consequences inherent in the simplifying assumptions required of popular risk quantification methods such as PRA, Quantitative Risk Assessment (QRA), and Probabilistic Safety Assessment (PSA).

Models of efficacious protective system design and management cannot be informed by “probability numbers”—such numbers are unavailable as a practical matter. This is to say, the condition “safe enough” is almost never expressible as a numerical probability as argued by Hansson (see [7, 8, 9], for perspectives on safety). Hence, efficacy of protective systems should rightly be explored in terms of insights from the mathematical structure of stochastic processes serving as predictive models. To this end, the following develops models at a level of abstraction sufficient to capture stochastic structures that agree with practical engineering understandings. Most importantly, a top-down exposition leads to straightforward identification of the particularizing assumptions necessary to calibrate reactor protective system stochastic predictive models with observational data. In other words, the engineering physics that must hold to compute valid statistical estimates of risk metrics can be identified. Hansson in ([10], Section 4) sets the stage where statistical models can be calibrated and where they cannot.

Clearly, hazardous technologies and their protections are engineered systems that operate according to laws of physics, and their operational behaviors are governed by ongoing design, maintenance, and management decisions. These systems are carefully monitored over time so that information gained can direct re-designs and maintenance intended to improve productivity and safety. Of course, our uncertainty regarding how protections might perform in the future is, at best, a reflection of our historical understanding of physical behaviors, both technological and environmental. For example, we cannot forecast the future arrival of possible operational problems, the possibility of which we are a priori unaware.8 Thus, any understanding of the efficacy of protections must connect physics with a time-dependent state of knowledge about that physics. In this regard, filtered probability spaces are indispensable when exploring engineered protective system efficacy.

System physics are approached in here from an operations perspective to establish the analytical framework within which physical features can be mathematically described as time dependent functionals. This framework will be sufficiently general to accommodate most hazardous production systems. Analytical modeling of operations is concerned with characterizing a system’s temporal dynamics as the evolution of a system’s “state” over time. Almost always, operations is concerned with predicting, up to probability law, transitions among particular sets of system states given certain observable historical behaviors. Such transition dynamics are captured as stochastic processes.

The theory of stochastic processes is mature and provides a valid framework for understanding operations—derivations or detailed explanations of the central results applied in here are found in the widely accessible literature. Avoiding repetition, operation of hazardous systems and their protections are modeled in here by a stochastic point process and cited in the appropriate literature without proof.

Advertisement

2. Construction of accident counting processes for reactor risk analysis

Consider Figure 2 showing how causality is aggregated from the underlying physics of nature, through the limitations of engineering understanding of the physics, up to the operational level where the physics is collapsed to observations functional or failed or, more succinctly, [0,1]. Risk quantification is done on observations whereby probabilities are assigned to events. Such probabilities are assumed in a probability space with events in sets of failure outcomes, o, taking place at the level of the underlying physics shown in Figure 2.

Figure 2.

Concept of knowledge aggregation from the underlying physics (both known and unknown) up to the operational level of observation.

2.1 Physics: based modeling in a general stochastic setting

Protective systems safeguard against future accidents. The design of protective systems necessarily relies on deterministic physical laws (physics) embedded within a stochastic framework that analytically captures uncertainty about future protection behaviors. This stochastic setting yields predictive models useful for understanding the efficacy of given protection designs. We establish the following:

Definition 2.1 (Object). A collection of devices, subsystems, systems, and environments enclosed within a specified logical control volume.

Definition 2.2 (State). The unique numerical assignment to specific features of an object’s physical condition (typically in SI units).

Definition 2.3 (State Space). The set of all possible object states.

Definition 2.4 (State Variable). A mapping from the domain of an object’s physical condition into its state space.9 Typically, state variables are indexed by time.

Definition 2.5 (State Trajectory). The evolution of an object’s state over a collection of specific time intervals.

Definition 2.6 (Predictive Models). Probability laws with an object’s state variables indexed by time. Predictive models typically include a time t=0 that delineates future from past.

Remark 1 Predictive models do not predict the future. Rather, they frame uncertainty about future physical behaviors in terms of the present state of knowledge.

Objects of modeling interest can be comprised of subordinate objects; that is, an object can be the logical union of other objects. An object is atomic if it is not composed of subordinate objects. It follows that, an object’s state space is defined on the union of its subordinate atomic object state spaces. The state of an object typically evolves over time. Hence, predictive modeling requires not only specifying the state variable mappings that assign state values to material and environmental conditions, but also crafting a mathematical characterization of the interaction of an object’s state variables over time. The control volume defining any object connects its features with the laws of physics that govern its evolution, and it is our understanding of physics that allows us to mathematically characterize the time dependent interaction among an object’s state variables. The time dependent behavior of an object’s state is referred to as a state trajectory. The physical laws that govern state trajectories are, of course, specific to the object being modeled. Inasmuch as one’s interests lie with establishing a mathematical framework within which predictive models for hazardous technologies can be crafted, they need only make reference to specific physics where application dictates.

Remark 2 Valid predictive models must respect two practical engineering constraints: 1) State trajectories are never physically observable beyond the present, and 2) Engineers are not clairvoyants. Valid predictive models are strictly informed by historical information, while allowing for the acquisition of additional information as the object’s future unfolds (see [11], for an interesting perspective). The evolution of information history cannot be omitted from valid predictive models.

2.1.1 Life cycle, state trajectories, and the state process

The terminology life cycle is used when referring to object state trajectories that are defined over the entire open time interval +, giving a complete history of an object’s state over time. In predictive modeling, an object’s life cycle is rarely, if ever, identifiable with certainty at any finite time, and one must must be satisfied with expressing the likelihood that an observed state trajectory belongs to a specified collection of possible life cycles.

Clearly, an object’s life cycle is determined by the physics within its control volume. In the usual manner, let the domain of an object’s state variables be Ω, the collection of possible physical outcomes the object can experience. Further, let F be a σ-algebra on Ω such that ΩF forms a measurable space.

Elements of F are subsets of possible physical outcomes with a defined probability measure expressing the likelihood that a given subset of measurable outcomes contains an outcome of specific interest. P is a set function P:F01, the triple ΩFP forms a standard probability space (see [12], for a formal development).

Now, if Zt is the object’s state at time t, where Zt:ΩFSBS, for tR, then Zt is a measurable state variable mapping.10 If A is a measurable subset of R,Ztω:R+S, and tA, then fixed ωΩ, forms a state trajectory under physical outcome ω. When A=R,Ztω becomes the object’s life cycle under outcome ω. Now, without loss of generality, require that for all ωΩ and tR,Ztω be right-continuous with left-hand limits. It then follows that the collection of random variables Ztt0 forms a stochastic process, with state space S. Z=Ztt0 is referred to as an object’s state process.

2.1.2 Probability law and validation

Predictive modeling requires framing uncertainty about an object’s state through a probability lawZ:BS01 on its state process. Here, for each n-sequence two-tuple B¯t¯n=t1B1tnBn such that tiR+ and BiBS,i=1,,n<,

ZB¯t¯nPi=1nZti1Bi=PωΩ:Zt1ωB1ZtnωBn.E1

The measurability of all state variables in Z ensures that its probability law Z defines a probability measure on the measurable space SBS. Hence, uncertainty associated with any physics-based predictive model is captured on the probability space SBSZ.

Definition 2.7 (Predictive Model). For an object with state processZdefined onΩFP, the probability spaceSBSZis the object’s predictive model.

Validity of a specific predictive model SBSZ requires that it agree with both the physics underlying Z and historical observations of physical behaviors of the object that Z represents. Without crawling deeply into the weeds of model validation, it is easily reasoned that the degree of difficulty in proving validity is greatly influenced by the cardinality of both S and BS. Suffice it to say that large state spaces associated with complicated object trajectories and large information bearing σ-algebras BS require vast amounts of observational data to prove model validity. In fact, validating BS is almost always a practical impossibility, since quantifying the probability law Z stands among the most challenging aspects of stochastic modeling. However, careful stochastic analysis provides a direct means of identifying invalid models.

Observation 1 A necessary condition for an object’s state process to be valid is that it must be congruent with well-understood physics. Conversely, in circumstances where Z is not congruent with well-understood physics, it immediately follows that the object’s predictive model SBSZ must be invalid.

Observation 1 reveals the importance of stochastic modeling in a general setting where an object’s physics–based behaviors are mapped into time–dependent state trajectories. While proving validity of a predictive model is rarely an achievable engineering objective, it is often straightforward to recognize invalidity of specific models. The following Section 2.2 points out how popular risk modeling methodologies for protective system operations can easily obscure underlying physics and can result in a failure to recognize model invalidity.

2.2 Stochastic modeling: production, protection and environment

When concern lies with the efficacy of protections, Z is understood to model an object representing a hazardous production system protections with S as the object state space, onto which all underlying physics is mapped. Designate three subordinate objects as: a production system, a protective system, and an environment.11 Without loss of generality, assume that these three subordinate objects are mutually exclusive (i.e., their respective control volumes do not intersect); mutual exclusivity does not imply that production, protection, and environment operate independently.

The mutual exclusivity of production, protection, and environment ensures that state space S is formed as the cross product of the subordinate object states. Here,

S=SX×SR×SY,E2

where

SX is the subspace of production system states,

SR is the subspace of protection system states, and.

SY is the subspace of environment states.

Take the subspaces identified in Eq. (2) to each be a manifold with boundary. By definition, any element of SX numerically characterizes a production state. Similarly, elements of SR quantify protection states, while subsets of SY describe an environmental state within which production and protection exist. Now,

X=Xt:t0,is  the  production  state  process,whereXt:ΩFSXBSXis  the  production  system  stateattimet;
R=Rt:t0,is  the  protection  system  state process,whereRt:ΩFSRBSRis  the  protection  system  stateattimet;and
Y=Yt:t0,is  the  environment  state  process,whereYt:ΩFSYBSYis  the  protection  system  stateattimet.

And, by construction Z=XRY, and for all t0

Zt=XtRtYt:ΩFSBSE3

where

BS=BSXBSRBSY.E4

2.2.1 Dependency among production, environment, and protection

By convention, take all trajectories of X, R, and Y to be càdlàg (right-continuous with left-hand limits). S partitioned with SX,SR, and SY allows framing the interaction among production, protection and environment over time. It is understood that the temporal dynamics of these three stochastic state processes are not mutually independent. Here, with the σ-algebras respectively generated by the processes X, S, Y, and Z written as

FXσX=limtFtX=σXu<ut,E5
FRσR=limtFtR=σRu<ut,E6
FYσY=limtFtY=σYu<ut,E7
FZσZ=limtFtZ=σZu<ut.E8

There can exist some BZ=BXBRBYFZ, where BXFX,BRFR and BYFY such that

PBXBRBYPBXPBRPBY.E9

Dependence among X, R, and Y is intuitively clear: The production state X and the protection state R both depend on Y, the common environment within which they operate. So, there is some BS such that the likelihood that production and protection are in B at time t0 must as a matter of physics depend on the evolution of their common operating environment. For example the arrival of natural disasters and the wear-out rate of equipment are physics-based environmental influences on both production and protection. Thus,

PXtBYuutPXtBPRtBYuutPRtB,E10

and, consequently, X, R are not generally independent. However, the future environment process Y is typically uninfluential on the histories of production and protection; we call this the Lack of Anticipation Property (LAP).

Definition 2.8 (Lack of Anticipation Property). For all t,s0 and BBS,

PYt+sBXuRu:ut=PYt+sBE11

The LAP plays a central role in predictive modeling of safety-critical protections, and is represented schematically in Figure 3.

Figure 3.

Schematic of dependence among production, protection, and environment trajectories.

2.2.2 Information flow and filtrations

Time and state dependencies among X, R and Y are informed by the flow of historical information. Information is characterized in a time indexed collection of sub-σ-algebras Ftt0, often called a history, forming a filtration augmenting the standard probability space on which state variables are defined. Here, the filtered probability space ΩFFtt0P is such that the state variables Zt=XtRtYt are Ft-measurable for all t0. This filtration has the standard properties where,

Property 1F0 contains all P–null sets (completeness).

Property 2lims0Ft+ss>tFs=Ft (right–continuity).

Property 3FtFt+s (monotonicity).

Property 4limtFtt0Ft=F (convergence).

The first property is an easily satisfied technical requirement from measure theory. The remaining three properties are normative and intuitively understandable. Property 2 indicates that the acquisition of engineering information is not necessarily smooth, with no information acquired in certain time intervals, while large amounts can be acquired at certain points in time. Property 3 asserts that information gained through discovery is not lost over time. Property 3 and Property 4 acknowledge that one cannot be convinced that all useful modeling information will be revealed prior to the end of lifecycles.

All information (engineering design, maintenance, operations, weather, management, etc.) associated with all possible object life cycles is contained in F. Information available at time t< is limited to the sub-σ-algebra Ft. Note that information represented in Ft is not necessarily limited to FtZ, the σ-algebra generated by Zuut (for example, maintenance schedules are not directly represented in FtZ, but they are extremely useful in predictive modeling). Thus, in general, FtX,FtR, and FtY are each sub-σ-algebras of FtZ, and FtZFt, for all t0.

Remark 3 Predictive modeling must rely on presently available information. With t=0 taken as the beginning of all lifecycles in Ω, the sub-σ-algebra Ft contains all modeling information acquirable by time t0. The filtration Ftt0 contains all acquirable information flows over all possible complete lifecycles. Ftt0 is indispensable in predictive modeling because it enables modelers to represent the current state of knowledge. Omitting filtrations from predictive modeling implies that all possible modeling information is available at time t=0; that is, one assumes that Ft=F for all t0. The filtered probability space ΩFFtt0P allows modelers to characterize their uncertainty about state variable values as a conditional (on available information) probability. For example with t,s0,PZt+sBFt) is the likelihood that the state of a reactor and its protection will be in condition BBS at time t+s, given the information available at time t.

2.2.3 Calibrating probability law

In recent years, Uncertainty Quantification (UQ) has emerged as a topic of considerable interest in risk analysis and safety engineering. Generally as the name suggests, UQ explores means to measure or judge the size of uncertainty. In the context of a nuclear reactor with regulated safeguards, operators and regulators are concerned with the efficacy of protection, and the predictive model SBSZ provides them a functional relationship between life cycle physics and uncertainty. Without exception, any useful UQ metric m maps probability law into a real-valued n-vector; that is, m:ZRn,n1. With BS taken as the support of m, it is required that m be BB-measurable. This mild measurability restriction admits UQ metrics that might not require a complete characterization of the probability law Z. Nonetheless, all UQ metrics require that (if not all) at least some part of Z be calibrated. Calibrating Z is generally quite challenging.

Recall from the Kolmogorov Extension Theorem that specifying the probability law Z is equivalent to specifying all finite joint distributions on the state process Z=Ztt0 [13]. Of course, there is generally an uncountable number of finite joint distributions on stochastic processes that evolve over continuous time. So, unless the state process Z exhibits very special independence properties (e.g., regeneration, stationarity, ergodicity) in combination with a small support for the UQ metric m, the amount of historical data required to statistically calibrate elements of Z is staggering.

Intuitively, crafting UQ metrics on predictive models that are useful for informing the efficacy of reactor protections boils down to:

  1. Deciding how much detail a predictive model must include to support UQ metrics that modelers believe are useful in supporting reactor operation and public policy decisions.

  2. Calibrating the needed elements of Z using historical information.

In practice, when particularizing Z to a specific reactor and safety–critical protective system, choosing a fine granularity on the support of any UQ metric m:ZRn introduces a model calibration burden for Z that is typically insurmountable. Section 4 reviews a collection of very strong underlying modeling assumptions required to calibrate Core Damage Frequency (CDF), the widely used UQ metric that is integral to the popular PRA methodology. The bottom line, here, is that UQ can never escape the challenges of calibrating z; thus, UQ metrics applied to reactor protective system efficacy analysis should be evaluated with a healthy engineering mistrust.

Advertisement

3. Operations modeling

Operations generally refers to the study of stochastic point processes that are embedded within an object’s state process Z. It follows that in operations modeling and analysis time is the only physical variable of interest. An operations point process is typically embedded within Z and generates an increasing sequence of random times TnnN identifying the times of occurrence of a particular non-quantitative feature of Z that is identifiable through observable state changes. In principle, the state space S is covered with a partition GBS, where G is an at most countable collection of sets. The elements of G correspond to important non-quantitative features on the state space S, and the random variable Tn marks the time of the nth occurrence of a state transition from one particular element of G to another.

In operations modeling there are practical measurability issues that must be considered when examining transitions among elements of the state space G by the trajectories of state process Z (which is, of course, adapted to the filtration Ftt0 of the filtered probability space ΩFFtt0P).

  1. In principle, Zt1G must be F-measurable for all t0.

  2. There is no guarantee that Zt1GFtZ.

  3. There is no guarantee that Zt1GFt (even when Zt1GF).

If Zt1GFt, then for all t0 there can exist important, and as yet undiscovered, operations events that may be observed in the future. The very practical circumstances where Zt1G is not F0-measurable are explored in Section 4.

In the remainder of this section, except where otherwise noted, the typical development of operations modeling where random quantities are defined on a standard probability space ΩFP, free from the information flow dynamics characteristic of filtered probability spaces, is followed.

Remark 4 Operations modeling is common in engineering practice. For example in reliability analysis G=BBc partitions the state space S such that B is the collection of all reliable states. The standard indicator mapping 1B:SBS01σ01 of physical states into the set {0, 1} defines a unit-less random variable on the probability space SBSZ. The object’s availability process AAtt0, where At=1BZt, has trajectories that proceed only in jumps (up or down) of magnitude one.12 When Tn is taken as the time of the nth downward jump of A, the sequence TTnnN corresponds to a stochastic point process marking time epochs of object state transition from working to failed. The failure-time process T is clearly subordinate to the state process Z. It is the point processTthat characterizes object failures from an operations perspective.

An operations point process T is often analyzed through its corresponding counting process Q. Here,

QQtt0,E12

where,

Qt=n=110tTnE13

counts the number of operations epochs appearing in the closed time interval [0, t]. There is an obvious one-to-one correspondence between the sequence TnωnN and the trajectory Qtω,t0) for each ωΩ – a plot of one directly reveals the other. It easily follows that, for all nN and t0,Qtn=Tnt which directly implies the distributional relationships

PQtn=1PTn<tEQt=n=1PTnt.E14

The operations point process T analysis is approached through its corresponding counting process Q which is accessible through the martingale calculus.

3.1 Classification of states for reactor operations with protections

Consider now an object representing a nuclear reactor, its regulated protections, and the environment within which the reactor and protections operate. The state process Z is defined, as before, in Section 2.2.2. Operations modeling and analysis begins with partitioning the state space S according G=GpGpc where GpBS is the set of all persistent states and its complement Gpc is called the transient states. The set Gp is closed under all trajectories. That is, for all sGp and ωΩ

limt0<ut1GpZuω1GpcZuω=0;E15

thus, once entering the set of all persistent states Gp, a trajectory Ztω cannot depart the set. sGp is an absorbing state if for all ωΩ

limt0<ut1sZuω1scZuω=0.E16

Clearly, absorbing states are persistent.

As a practical matter, require that all trajectories Ztω,ωΩ and t0, of the state process terminate with retirement.13 Retirement occurs as either, (1) an inconsequential cessation of production, or (2) the consequence of an accident from which the reactor cannot recover. Once the reactor trajectory enters a retirement state, production terminates forever. All retirement states are taken as absorbing. Thus the absorbing states, designated RS, and transient states designated Rc, partition S. The transients states Rc can themselves be partitioned such that N are normal operating states and D distressed operating states. Clearly, the sets {N, D, R} also partition S. The possible state transitions among the elements of this partition are shown in Figure 4.

Figure 4.

Partitioned state transition diagram for state process Z.

3.2 The initiating event counting process

Recall that state variables are the three-tuple formed by ZtXtRtYt. Thus, when a trajectory Xtω,ωΩ and t0, of the production state process leaves the normal operating states NSX and enters the distressed states DSX, an initiating event has occurred. Recalling that by convention all state trajectories are right-continuous and thus càdlàg, all initiating events occur at the exact instant of transition from NSX to DSX. Arrival of an initiating event indicates that the reactor protective system should engage so as to mitigate potential harm that might arise with the reactor operating in the distressed states. In practice, initiating events are often (although not always) observable.

For all ωΩ and t0, define

dQtω1DSXXtω1NSXXtω.E17

Clearly for t>0,dQt:Ω01 is a random variable on the measurable space of possible life cycles ΩF. For all ωΩ and t0, it follows that

Qtω=0tdQuω0<ut1DSXXuω1NSXXuωE18

form the trajectories of the stochastic process Q=Qtt0 inherit the càdlàg property. We call Q the initiating event counting process. Here, as a practical consideration, we will require that Q0ω=0 for all ωΩ. That is, we assume that reactor production should be in a normal operating state at time t=0 at the beginning of lifecycle ω.14

It is a straightforward matter to construct the operations point process T=TnnN which captures the arrival times of initiating events. T is referred to as the initiating event process, and for each ωΩ and nN,

Tnωinft>00tdQsω=n.E19

The initiating event counting process Q plays a central role in our treatment of operations modeling and efficacy analysis of safety-critical reactor protections. Understanding the construction of Q provides important insights (to be discussed in subsequent sections) into the relationship between hazard, risk, and the efficacy.

Remark 5 When assuming that Zt1GF0 and with Qt:ΩFZNBN, by construction it is ensured that Qt is Ft–measurable for all t0; hence, the initiating event process Q is adapted to the filtration Ftt0 which contains the history generated by the state process Z. It is preferable to model operations processes on the filtered probability space ΩFFtt0P, because its filtration contains all available information ... not simply the natural filtration FtQt0 of Q. When PRA, QRA and PSA are used for risk analysis, they implicitly and strictly rely on the natural filtration

FtQt0σQtt0Ftt0E20

which contains only information about the occurrence times of initiating events, while ignoring all history of the state process Z, maintenance activity, weather, etc.

3.3 The accident counting process

When safety-critical protections function properly a distressed reactor (i.e., XtDSX) will avoid an accident and return to the set of normal operating states in N. Accidents are events influencing states of nature outside the control volume within which a nuclear reactor’s state process Z develops. Accidents are always a consequence of protection failure that causes collateral harm.15 For our purposes, when an initiating event leads to any level of collateral harm, then that event becomes an epoch of an accident.16 Possible time delays are allowed for between the arrival of an initiating event and any eventual collateral harm.17 To this end, distressed state DS is partitioned into those states C that impact physics outside the system control volume causing collateral harm and E=D/C those distressed states that do not cause collateral harm. Partitioning D allows an operations characterization of an accident:

Definition 3.1. An Epoch of accident occurs when the trajectory XtRtω,ωΩ is such that the production state process X enters CSXDS while the state of the protection process R is in DSR.

Guided by the transition diagram of Figure 5, a modification of Figure 4, showing the partitioning of distressed states D into C and E, the accident counting process and the accident point process can be constructed.

Figure 5.

Partitioned state transition diagram for the state process Z with the distresses states D decomposed into C (catastrophic states) and E non-catastrophic distressed states.

Remark 6 Note that in the interval following an initiating event indicating that the state process has moved into distress, the system state can possibly transition many times between accident states in C and non-accident distressed states in E before either returning to normal or being retired. In practical modeling scenarios, state transitions across the partition of distressed states might take months (or even years). Thus, the initiating event leading to distress might not exceed to an accident for quite some time. In such situations, knowing that an initiating event has just occurred does not necessarily reveal whether or not the system state is on a trajectory of accident.

Now, define

dQtCωQtCωQtCω=1CSXXtω1DSRRtω1NSXXtω.E21

Note that the time Tmω of the mth arriving epoch of an accident is given by

TmCωinft>00tdQuCω=m.E22

When m is the index of the last arriving epoch of an accident before retirement, all subsequent accidents are taken by convention to occur at infinity. Clearly, the random sequence TmCmN is a thinning of the initiating event process T. The point process TCTmCmN is the accident process.18

With QtC being the number of arriving accidents in the interval (0, t], it follows that for each ωΩ,t0, and mN

QtCω=QTmCωω,tTmCωTm+1Cω,E23

and, in practice with QtCω=0 for all ωΩ,

QtCω=0tdQuCω0<ut1CSXXuω1DSRRuω1NSXXuω.E24

QCQtCt0 is called the accident counting process. By construction, QC is adapted to the filtration of ΩFFtt0P.

Remark 7 When assuming that Zt1GF0, the accident counting process QC is, clearly, subordinate the initiating event counting process Q. That is, each epoch of arrival in QC is also an epoch of arrival in Q. It is important to note, however, that while both QC and Q are adapted to the natural filtration Ftt0,QCis not adapted to the natural filtration FtQt0 of the initiating event process Q. Thus, the history of initiating events alone contains insufficient information to construct the accident thinning—a physics-based insight often overlooked in popular quantitative risk methodologies.

Finally, as a notational convenience, define for all ωΩ and t0

Atω1NSRRtω,E25

where At:SBS(01,B01 is a random variable of the filtered probability space ΩFFtt0P. Clearly,

At=1,protectionsareavailableattimet0,otherwise.E26

We call A=Att0 the protection availability process, which inherits càdlàg properties from R. Substituting Eq. (17) and Eq. (25) into Eq. (24) gives

QtCω=0t1AuωdQuωE27

for all ωΩ and t0. And, it follows directly from Eq. (22) and Eq. (27) that for all mN,ωΩ and t0

TmCωinft>00t1AuωdQuω=m.E28

The importance of Eq. (27) is that it gives the dynamic relationship between the number of accidents over time in terms of arriving initiating events and the reliability of protections. It should be clear from the construction of Eq. (27) from trajectories of the state process Z that the stream of arriving initiating events and the reliability of protections are not generally independent since both random phenomena are stochastically dependent on the dynamics of the environment process Y. Further, Eq. (22) establishes the relationship between the accident time point process TC and initiation event arrivals and protection reliability where, again, stochastic dependence between initiating event arrivals and protection reliability cannot be ignored. Eqs. (22) and (27) play a central role in the developments presented in Sections 4 and 5.

Advertisement

4. Unknown-unknowns

When designing nuclear reactor protections, it is impossible to foresee and design for every circumstance that might lead to an accident. Such design deficiencies are often called unknown-unknowns.19 Unknown-unknown design deficiencies are routinely discovered, documented, and corrected. It can be shown that uncertainty about the consequences of as yet undiscovered design deficiencies cannot be quantified. Thus, Probability Quantification (PQ) methodologies without exception overlook the influence of unknown-unknowns on initiating events, protection reliability, and ultimately epochs of accidents. Overlooking unknown-unknowns can only bias predictive accident metrics such as CDF optimistically, presenting an unavoidable pitfall for quantitative methodologies including PRA, QRA and PSA that make no use of filtered probability spaces.

Unknown-unknown failure modes in protective systems will be the focus of analytical developments in the following.20 Clearly, the possibility of undiscovered failure modes in reactor protections is a matter of great concern to both operators and regulators. The NRC has established rigorous reporting and operation protocols focused on the discovery of newly discovered protection design flaws.

The analytical consequences of unknown-unknowns are revealed only when the system state process Z is defined on the filtered probability space ΩFFtt0P, where all predictive assertions must be addressed in the context of information currently available in the filtration Ftt0. In particular suppose that in the design of protections there exists protection failure modes that are undiscovered at the time of deployment. These design inadequacies will only be discovered during operations and are then incorporated into the existing body of information. Thus, protection failure modes unknown at time t=0 will enter into the filtration Ftt0 at some time t>0 upon discovery.

By construction as shown in Sections 2 and 3 the protection availability process A is adapted to the filtration Ftt0 (i.e., At is Ft-measurable). When for example examining the likelihood that protections are not in a normal condition at any time t0, it follows that

PAt=0=1EAt=1EEAtFt,E29

because the random variable EAtFt is well-defined.21

In predictive modeling, it is often the case that the condition of protections at some random time Aτ, where τ:ΩR+, is of great interest. Here, care must exercised because Eq. (29) does not necessarily hold when substituting Aτ for At. Let GfS be the set of all states where protections are unavailable (protections can be either failed or out of service due to maintenance). Now for t taken at the present time, let τ be the time of the next arriving initiating event in the initiating event process T. It follows that

τ=TQt+1>t,E30

with Qt being the number of initiating events arriving in the interval [0, t]. Now consider two cases: (1) At1Gf is F0-measurable, and (2) At1Gf is notF0-measurable.

4.1 Case 1: At1GfF0

By definition, GfF is the collection of all states in S where protection is unavailable. Thus, any failure mode for protection must be reflected as a state sGpS. When GpF0, it follows that every possible failure mode must be known at time t=0 since the pre-image of Gf through the random variable At appears in F0. By definition the filtration Ftt0,F0Ft for all t>0 implies that At1GpFt for all t0. It now follows that for any u>0At+u1GpFt and, since τ is an Ft stopping time, τ>t,Aτ1GpFt. In other words, when all possible states where protection is unavailable (including protection failure modes) are understood at time t=0, then

PAτ=0=1EAτ=1EEAτFtE31

gives the likelihood that the next arriving initiating event will result in an accident.

4.2 Case 2: At1GfF0

Suppose now that there exist protection failure modes that are unknown at time t=0. This implies that At1GfF0. When At1Gf is not F0-measurable there can be no guarantee that At1Gf will be Ft-measurable for any 0<t<. That is, for any u>0, we cannot ensure that At+u is Ft-measurable. And, because τ>0, we cannot ensure that Aτ is Ft-measurable even though τ is an Ft stopping time. In fact there will always be some t>0 where AτFt which implies that EAτFt is not well-defined. From the failure of EAτFt to satisfy the definition of a random variable on Ω, it can be concluded that when there are undiscovered protection failure modes, Eq. (31) cannot hold.

4.3 The practical implications of undiscovered protection failure modes

It is reasonable at this juncture to consider the extent to which protection failure modes not identified in F0 are problematic. Let τd be the time of first discovery of a heretofore undiscovered protection failure mode. Some observations that can be normatively understood are:

  • The discovery time τd of a specific heretofore undiscovered failure mode is random and measurable with respect to probability space ΩFP.

  • τd is not an Ftt0 stopping time. Clearly, τtcFt.

  • Any protection failure mode will almost surely be found over the lifecycle of protections.

  • It is possible that a failure mode is first discovered through accident postmortem analysis (i.e., the failure mode caused an accident upon its first appearance).

  • For any time t0 a non-clairvoyant cannot rule out the possibility of remaining undiscovered protection failure modes.

  • If undiscovered failure modes are in play, the filtration Ftt0 has not yet converged and all predictive PQ metrics must be formed conditioned on the current information represented in filtration Ftt0 ... which does not include all information represented in F.

  • Nearly all historical nuclear reactor accidents have been attributed to heretofore unanticipated protection failures.

If there is sufficient confidence in a non-clairvoyant belief that that all protection failure modes have been discovered, then all information impacting protection design will be found in the tail σ-algebra T, where

Tt0σAuut.E32

By definition, TF and characterizes design information in the remote future of A. In the remote future of protections, there can be no undiscovered failure modes.

Typically, PQ methodologies (e.g., PRA, QRA, PSA) implicitly rely on the assumption that events associated with protections are T-measurable, because this assumption guarantees the existence of

A=limtAtandA¯=limt1t0tAuωdu,ωΩ.E33

The expected value E[A] and its statistical estimator A¯ play essential roles in PRA, QRA and PSA, and they are only practically accessible under the assumption that all available information is in the tail σ-algebra T. The estimator A¯, when computed with data collected other than in T will always underestimate the true value of E[A], a dangerous bias.

Advertisement

5. The hierarchy of modeling assumptions supporting PRA

PRA appears, today, in regulatory language, NRC directives and even in federal legislation. PRA methodology relies on very strong modeling assumptions that are rarely (if ever) explicitly qualified in practice. Owing to the wide acceptance and application of PRA in the civilian nuclear industry, it is useful to identify and explain the assumptions underlying this popular risk analysis methodology. These modeling assumptions are best explained hierarchically so as to reveal a sequence of increasingly strong and necessary conditions leading to the computation of CDF ... the central risk metric derived from PRA. These assumptions are difficult to justify in practice and failure to satisfy any of them leads to optimistically biased estimates for CDF.

Begin again with the state process Z characterizing the time dependent behavior of a reactor’s production, protections, and common surrounding environment defined on the filtered probability space ΩFFtt0P, where the state variables are mapped onto the measurable state space SBS. The state space S is partitioned according to Figure 5 in Section 3.

Now identify seven modeling assumptions that must be enforced in order to calibrate CDF using historical data.

Assumption 1 There are no unknown-unknowns.

Assumption 2 There are no absorbing states.

Assumption 3 CDF is well-defined.

Assumption 4 Arriving initiating events see the time average of protection availability.

Assumption 5 Arriving initiating events form a Poisson process.

Assumption 6 The protection availability process is stationary and ergodic.

Assumption 7 Protection unavailability is independent of all initiating events.

Assumptions 1–7 are numbered such that each implicitly enforces all lower numbered assumptions. Assumptions 6 and 7 are required in order to calibrate CDF with historical data. The consequence of these cumulative assumptions are examined one-at-a-time and in order.

5.1 Cumulative assumption: there are no unknown-unknowns

As discussed in Section 4, assuming that there are no as yet undiscovered states requires that

Zt1BF0,E34

for BBS and t0. Recall that by definition of the filtration Ftt0Ft-measurability is also ensured. This modeling assumption cannot be verified so long as a reactor is not retired. Obviously, with the random time τ:Ω0 taken as the time of discovery of the last heretofore undiscovered state, τ is not Ft-measurable for any t<, which implies that there cannot be an observable condition confirming that there are not more undiscovered states. Clearly, the no unknown-unknowns assumption reflects a (believed) near clairvoyant understanding of the technologies being modeled.

5.2 Cumulative assumption: there are no absorbing states

The absence of absorbing states implies that either all sS are transient, or there exists at least one non-singular collection of persistent states BαS,αA such that

PlimtZtBα>0,E35

and when a state trajectory enters Bα it never exits, visiting each state sBα, infinitely often. Further,

PlimtZtαABα=1,E36

which implies that each state trajectory must eventually enter a non-singular collection of persistent states that it visits infinitely often.

The PRA methodology implicitly rejects the possibility that all states in S are transient, and therefore accepts that all state trajectories visit some collection of states infinitely often. Further, PRA some of the states visited infinitely often are accidents (else there would be no reason to perform PRA). It follows that accepting the assumption that reactors are never retired (either through an accident or cessation of operations), there will be accidents ad infinitum.

5.3 Cumulative assumption: CDF is well defined

Accepting Assumptions 1 and 2, it is feasible to investigate CDF. Here, as before, QtC, counts the number of core damage events a reactor suffers in the interval [0, t]. CDF is a frequency and thus is the limiting number of core damage events per unit time. That is,

Definition 5.1 (Core Damage Frequency).

CDF=limt1tQtCE37

whenever convergence to a constant occurs a.s.

CDF is treated as a numerical constant. However, convergence of Eq. (37) is not guaranteed. Further, even when CDF converges, there is no guarantee that its limit is a constant; convergence to a random variable is a completely plausible circumstance. It must be emphasized that CDF is a numerical constant, estimates of which are used to gauge the risk of suffering a core damage event.

The existence of CDF is determined in part on the dynamics of initiating event arrivals. The limiting arrival rate λ of initiating events is defined as follows.

Definition 5.2 (Initiating Event Frequency).

λ=limt1tQtE38

convergence occurs a.s. In general λ can be a random variable, and convergence to a constant occurs only when λ=Eλ.

Recall that QC and Q are right-continuous. It follows directly from definitions 5.1 and 5.2 that when λ is a constant,

CDF=a.s.limtQtCQtQtt=E39

where, often the split fraction is

plimtQtCQt.E40

Thus, CDF is the product of the initiating event frequency with the proportion of initiating events that exceed to core damage. Typically, estimating λ is straightforward. Estimating p is more challenging. Note that while p takes values in the interval [0, 1], it is defined as the limiting value of a ratio of random variables. However, under certain circumstances p can be interpreted as the probability that an arriving initiating event will exceed to core damage. One such circumstance occurs when initiating events form an ordinary Poisson process. Then, the well-known Poisson Arrivals See Time Averages (PASTA) result applies, and p can be computed as the limiting unavailability of reactor protections (see [14]). Unfortunately, the assumptions needed to justify PASTA defy practical justification. Consequently, estimating p is quite difficult in practice.

There are a variety of approaches for crafting estimators of p that incorporate the joint histories of initiating event arrivals, protection maintenance activity, environmental conditions, etc. Monte Carlo methods, owing to their adaptability to complex engineering models, have gained acceptance and popularity for estimating p and other statistics. These methods are not, however, a panacea because they require characterizing probability laws on subordinate stochastic processes that must be mapped into the dynamics of protection availability in order to build useful estimators. Characterizing probability laws on stochastic processes is often impractical due to the intensive data support required for all but the most stylized processes.

In order to better appreciate the manner in which CDF jointly depends on the arrival of initiating events and the efficacy of reactor protections, we will appeal to the martingale characterization of stochastic point processes (see [15]). For practical considerations, we require that initiating events occur one at a time a.s.

Since the trajectories of Q are nonnegative non-decreasing and proceed in jumps of size one a.s., clearly EQt+uFtQt for all t,u0. Thus, Q forms a sub-martingale on the filtered probability space ΩFFtt0P. Appealing to the standard results from the martingale calculus (see [16]), it now follows from the Doob-Mayer Decomposition Theorem that

Qt=Mt+Λt,E41

where the process MMtt0 forms an Ft–martingale with compensator Λtt0. Hence,

EMt+uFt=Mt=0E42

for all t,u0, and Λt is increasing a.s. and Ft–predictable, with

Λt=0tλudu.E43

Here, λt is well-defined whenever for all nonnegative, Ft-predictable Ctt0

E0CtdQt=E0CtλtdtE44

is satisfied. M is called the initiating event martingale.

When well defined, λt is a unique Radon-Nikodym derivative defined on the usual equivalence class, with the stochastic intensity process λtt0 adapted to Ftt0 and predictable. Informally, λt=EdQtFt and can be understood as the propensity for an initiating event to arrive in the next instant of time given the history of initiating events and reactor protections.

Now, consider the martingale transform MtC of protection unavailability 1At with respect to Mt of Eq. (41), where MtC=def0t1AudMu=0t1AudQu0t1AudΛu.

Proposition 1 (Core Damage Martingale)MCMtCt0 is a martingale whenever the stochastic intensity process of arriving initiating events λtt0 exists.

Proof: Since At is Ft–predictable and 0Atω1 for all ωΩ and t0, it follows that MtCt0 is also an Ft–martingale (see [17]), and noting that QtC=0t1AudQu counts the number of core damage events in the interval [0, t], we have that

MtC=QtC0t1AudΛuE45

and, substituting from Eq. (43) gives

MtC=QtCΛtC.E46

We refer to MC as the Core Damage Martingale and its compensator is given by

ΛtC=0tλuCdu=0t1Auλudu.E47

Remark 8Eq. (46) stands as the most general available expression characterizing the relationship among core damage events, the arrival of initiating events, and the efficacy of reactor protections. It is important to keep in mind that, for all t0,MtC,QtC,λtC,At, and (in particular) λt are all random variables. Hence, Eq. (46) is nontrivial and must be examined in the context of stochastic integration.

Consider now, the following proposition:

Proposition 2 (Existence of CDF) If

limtMtCt=a.s.0,E48

then λC exists (and is possibly a random variable) and for almost (a.a.) ωΩ

λCω=limtQtCωt=limt1t0t1AuωλuωduE49

where, 0<λCω<. That is,

limtMtCt=a.s.0ifandonlyiflimtQtCt=a.s.λCandlimt1t0t1Auλudu=a.s.λC.E50

Proof: Proposition 2 is an obvious consequence of Definition 5.1 and Eq. (46).

Remark 9 Clearly, CDF exists only if

MtCta.s.0andλC=a.s.EλC<.E51

Proposition 2 reveals the challenge in estimating CDF. In the absence of observed core damage events, predictive estimates of CDF must be formulated in terms of phenomena that can be observed. To this end, analysts must rely on observations of initiating event arrival times, and reactor protection performance (principally in the form of maintenance records and failure data). These observations are, of course, insufficient to capture the joint dynamics of Atλtt0 needed to directly employ the strong law relationship of Proposition 2, where,

λC=a.s.limtΛtCt=limt1t0t1Auλudu.E52

Monte Carlo methods do not escape the difficulty of computing λC. A and λt0 are not mutually independent (even when Q is Poisson with rate λ) and λt is not directly observable. Since this dependence requires a Monte Carlo model to rely on an accurate characterization of the probability law on the joint stochastic process Atλtt0, it is clear that the data requirements to support accurate estimation of this probability law are beyond the practical reality of reactor unit operations records.

5.4 Cumulative assumption: arriving initiating events see time averages

Important insights regarding CDF are revealed by exploring the expectation of MC. In particular, we are interested in the consequences of the stochastic dependence the state of system protections Att0 and the arrival of initiating events Qtt0 and consequently λtt0. Consider now,

Proposition 3 (Moment Convergence) Suppose that

MtCt=a.s.0.E53

Then,

limtEMtCt=0ifandonlyiflimtEQtC]t=EλCandlimtE1t0t1Auλudu=EλC.E54

And, with the additional condition that λC=EλC<,

CDF=limt1t0tE1Auλudu.E55

Proof: Recall that almost sure convergence implies convergence in expectation (see [18]). Hence, Eq. (54) follows directly from Proposition 2. Nonnegativity of the integrand in Eq. (54) allows a routine application of Tonelli’s Theorem to exchange the order of expectation and integration to show Eq. (55) (see [19]). Finally, λC=EλC< implies that

QtCta.s.EλC<,E56

thus, ensuring the existence of CDF.

Recalling the definition of covariance, it immediately follows that.

Corollary 1 When λC=EλC<, then

CDF=a.s.limt1t0tcov1Auλudu+limt1t0tE1AuEλudu.E57

Proof: Simply apply the definitions of CDF and covariance.

Corollary 2 When λC=EλC<,

CDF=a.s.limt1t0tE1AuEλuduE58

if and only if EλtFt=Eλt for all t0.

Proof: It need only be shown that cov1Atλt=0 if and only if EλtFt=Eλt. First, assume that EλtFt=Eλt for all t0 for all t0. Note that

E1Atλt=E[E1AtλtFt=E[E1AtEλtFt=EλtE[E1AtFt=EλtE1At.E59

Thus, it follows from the definition of covariance that cov1Atλt=0. With cov1Atλt=0 it follows trivially that

E1Atλt=EλtE1At.E60

Corollary 3. When, EλtFt=Eλt=λ< and Ata.s.A, then

CDF=a.s.λlimt1t0t1Audu.E61

Proof: It follows from Eq. (58) that when EλtFt=Eλt=λ<,

CDF=a.s.λlimt1t0tE1Audu,E62

and when Ata.s.A,

λlimt1t0t1Audu=a.s.λlimt1t0tE1Audu,E63

Eq. (61) follows.

Remark 10 It is important to appreciate that Corollary 3 does not imply that the state of reactor protections A is independent of initiating event arrivals Q. The condition EλtFt=λ is simply congruent with the lack of anticipation property of At.22 The conditions establishing Corollary 3 allow for the possibility that initiating events can cause a failure of system protections (in addition to the possibility protections were already failed immediately prior to arrival).

Recall Eq. (38). Perhaps the most important consequence of assuming that arriving initiating events see time averages is:

Corollary 4. When EλtFt=λ, then

plimtQtCQt=EA.E64

Proof: Substitute Eq. (61) into Eq. (39).

5.5 Cumulative assumption: initiating events form a Poisson process

Eq. (61) admits, as a special case, the condition that the initiating event counting process Q forms an ordinary Poisson process of rate λ. By the Watanabe characterization, it is well known that the stochastic intensity of a counting process is a fixed constant if and only if that counting process forms an ordinary Poisson process. Thus, when Q forms an ordinary Poisson process of rate λ, its intensity λt=λ for all t0, where λ is a positive valued constant. Returning to Eq. (46), it follows that the constant λ can be moved outside the integral, giving

ΛtC=0t1Auλudu=λ0t1Audu.E65

With the elements of the compensator of the core damage martingale a.s. constant for all t0 as given by Eq. (65), a special case of Corollary 3 is created. The condition EλtFt=λ given in Corollary 3 is more general (and difficult to calibrate with historical data) than under the Poisson initiating event assumption (see [20]).

Remark 11 The Poisson initiating event assumption does not imply independence between A and Q; on the contrary, A and Q remain dependent since there is the possibility that arriving initiating events will disrupt protections and lead to an accident. Wolff gives a full development of Poisson arrivals see time averages, (see [14]).

An obvious benefit gained when Q is assumed Poisson is that statistical calibration of λ becomes accessible using historical data. Since all Poisson processes are stationary and ergodic, the parameter λ is easily estimated since

λ=limt1tQtωE66

for a.a. ωΩ.

Assuming that Q is Poisson does not change Eq. (61). But, all Poisson processes implicitly carry the very strong independent increments requirement, meaning here that the number of initiating events appearing in any collection of disjoint time intervals are mutually independent. In practice, the independent increments condition is an extremely difficult to justify.

5.6 Cumulative assumption: the protection process is stationary and ergodic

Returning to Corollary 3 with A=limtAt, it is clear that for a.a. ωΩ the split fraction p is given by

p=EA=limt1t0t1Auωdu.E67

Calibrating p with historical data, as is preferred with PRA, is generally inaccessible in the limit. However, assuming that the protection availability process A is both stationary and ergodic guarantees that A is measurable with respect to its tail σ-algebra for all t0, and thus

p=EA=EAt=PAt=0,E68

for any fixed t>0. Clearly, the best available estimate of the split fraction p along the (only) historically available ω is given by

p1t0t1Auωdu.E69

Typically, practitioners rationalize stationarity and ergodicity of A by claiming knowledge of a long history of protection operations augmented with an extensive observed history of initiating events and equipment maintenance activity. Based on a experience they will designate where to establish time t=0, in Eq. (69). However, the approximation Eq. (69) suffers a bias that is a consequence of well designed protections: The speed of convergence of Eq. (67) depends on observing initiating events that will cause protections to fail. Obviously, well designed reactor protections will experience few (if any) such events. This implies that even under the assumption that A is stationary and ergodic the Eq. (69) will be positively biased for all t<.

5.7 Cumulative assumption: protection unavailability is independent of all initiating events

The optimistic bias of Eq. (69) disappears under the additional assumption that A and Q are independent stochastic processes. Under this circumstance, since both the initiating event process T and the protection availability process A are adapted to the filtration Ftt0 as described in Section 2, it follows that

EATn=EAtE70

for all t0 and nN. Independence of A and Q is congruent with the belief that reactor protections are so robust as to never fail due to the impact of an initiating event. Or, equivalently, an initiating event will induce an accident only if it finds protections already out of service upon arrival.

Of course, the model assumption that A and Q are independent stochastic processes is very strong and hardly justifiable in practice. That said, PRA methodology relies directly on the cumulative assumptions leading to Eq. (69). The assignment of the split fraction value p appeals to ad hoc analysis (e.g., penetration factor) where little or no observed historical calibration data can be found.

Advertisement

6. Summary: reasonableness assumptions and the consequences of making them

Numbers such as CDF or Large Early Release Frequency (LERF) are intended to inform engineers, regulators, and other citizens on the quantitative level of risk posed by a commercial nuclear power plant. Risk in this context is against the Atomic Energy Act of 1954 as amended (AEA) requirement for “adequate protection” of the health and safety of the public.23 Protection contemplated in the AEA is to help prevent harm to the public from uncontrolled release of radioactive material. Such harms can come from accident scenarios involving reactivity control and loss of core cooling following reactor shutdown. Protective systems are put in place under regulation to minimize the likelihood of such scenarios.

Sections 3–5 review counting of consequential outcomes in hazardous processes such as commercial nuclear power plant technologies as they relate to “unknown-unknowns” (Section 4) and the assumptions required to obtain a numerical result where data are sparse (Section 5), particularly with respect to CDF. In the review, it is shown that PQ on robust protective systems, where little or no data are available for calibration and validation, will produce optimistic estimates for risk of protection breakdowns leading to disastrous consequences. The progressive introduction of assumptions required to obtain quantitative levels of risk and their relationship to nuclear power plant processes in current popular methods based on PQ include: (a) there are no unknown-unknowns, (b) there are no absorbing states, (c) CDF is well-defined, (d) arriving initiating events see the time average of protection availability, (e) arriving initiating events form a Poisson process, (f) the protection availability process is stationary and ergodic, and (g) protection unavailability is independent of all initiating events. The implications of such assumptions, when adopted, is that the number obtained will underestimate the frequency of an accident by an unknowable amount.

Commercial nuclear power, as currently regulated by Western standards, is arguably the most safe energy source (see again [7], on “safe” and “safety”) of all other energy technologies currently available. Regulatory standards, regulatory inspection and enforcement, management oversight, and engineering practice implement available engineering solutions against “unknown unknowns” that include safety margins, defense in depth, root cause analysis, and corrective action to mitigate the consequence of “unknown unknown” events as they actually appear or are imagined to appear. Investors, regulators, design engineers, operators, and the public can best review efficacy of protection using PQ up to the point of risk quantification. Such fundamental engineering processes as design, testing, maintenance, operation, and design revision are supported by PQ that produces and holds out hope for categorizing breakdown scenarios for example, by the level of support they afford. Risk management in modern fission reactor technologies is best managed when the acquisition of knowledge from observations during ongoing operations is applied to protection going forward. This implies a strong organizational commitment to efficacious root cause analysis and corrective action throughout an asset’s lifetime.

References

  1. 1. Kemeny JG. The need for change, the legacy of TMI: Report of the President’s Commission on the Accident at Three Mile Island, John G. Kemeny, chairman. Washington, D.C.: s.n.: for sale by the Supt. of Docs., U.S. Govt. Print. Off; 1979
  2. 2. Rogovin M. Three Mile Island: A Report to the Commissioners and to the Public. Vol. 1250. Nuclear Regulatory Commission, Special Inquiry Group; 1980
  3. 3. Tobias A. Decay heat. Progress in Nuclear Energy. 1980;5:1-93
  4. 4. Solberg Ø, Njå O. Reflections on the ontological status of risk. Journal of Risk Research. 2012;15(9):1201-1215
  5. 5. Cardano G, Wilks S. The Book on Games of Chance: “Liber de Ludo Aleae”, Translated by Sydneu Henry Gould. New York, NY, USA: Holt, Rinehart & Winston; 1961
  6. 6. Bernoulli J, Jacobi Bernoulli... Ars Conjectandi, Opus Posthumum. Accedit Tractatus de Seriebus Infinitis, et Epistola Gallice scripta de Ludo pilae reticularis. Basileae, Impensis Thurnisiorum, Fratrum – Werke 3, 1713. pp. 107-286
  7. 7. Hansson SO. Safety is an inherently inconsistent concept. Safety Science. 2012;50(7):1522-1527
  8. 8. Doorn N, Hansson S. Should probabilistic design replace safety factors? Philosophy & Technology. 2011;24(2):151-168
  9. 9. Möller N, Hansson SO, Peterson M. Safety is more than the antonym of risk. Journal of Applied Philosophy. 2006;23(4):419-432
  10. 10. Hansson SO. The epistemology of technological risk. Techné. 2005;9(2):68-80
  11. 11. McTaggart JE. The unreality of time. Mind. 1908;17(68):457-474
  12. 12. Kolmogorov AN. Foundations of the Theory of Probability: Second English Edition. New York, NY, USA: Courier Dover Publications; 2018
  13. 13. Øksendal B. Stochastic Differential Equations: An Introduction with Applications. Berlin, Germany: Springer Science & Business Media; 2013
  14. 14. Wolff RW. Poisson arrivals see time averages. Operations Research. 1982;30(2):223-231
  15. 15. Bremaud P. Point Processes and Queues: Martingale Dynamics. New York: Springer-Verlag; 1981
  16. 16. Çınlar E. Probability and Stochastics. New York, NY, USA: Springer; 2011
  17. 17. Rogers LCG, Williams D. Diffusions, Markov Processes and Martingales. 9th ed. Cambridge, UK: Cambridge University Press; 2000
  18. 18. Dudley RM. Real Analysis and Probability. Boca Raton, FL, USA: CRC Press; 2018
  19. 19. Folland GB. Real Analysis: Modern Techniques and their Applications. Vol. 40. New York, NY, USA: John Wiley & Sons; 1999
  20. 20. Melamed B, Whitt W. On arrivals that see time averages: A martingale approach. Journal of Applied Probability. 1990;27(2):376-384

Notes

  • The President’s Commission report “The Accident at Three Mile Island” states that the level of radioactivity released from the Three Mile Island accident was small Kemeny (see [1], p. 34).
  • Even this protective overpressure protective system was backed up with separate overpressure relief valves if the pressure had continued to rise.
  • The General Design Criteria require specific performance characteristics of containments. See, 36 FR 3256, 36 FR 3256, Feb. 20, 1971, as amended at 36 FR 12733, July 7, 1971; 41 FR 6258, Feb. 12, 1976; 43 FR 50163, Oct. 27, 1978; 51 FR 12505, Apr. 11, 1986; 52 FR 41294, Oct. 27, 1987; 64 FR 72002, Dec. 23, 1999; 72 FR 49505, Aug. 28, 2007.
  • See for example, https://www.nrc.gov/docs/ML1418/ML14188A495.pdf
  • A description of the condensate polishing system is given in [1]. It contains resin beads that chemically condition the water flowing through them and act like a filter susceptible to plugging up.
  • The “Law of Large Numbers” was formally proved by Bernoulli.
  • While there is a vast engineering literature touting non-probabilistic strategies for addressing uncertainty (fuzzy logic, interval arithmetic, Dempster-Shafer, etc), only probability theory has survived the unrelenting scrutiny of philosophers of mathematics as a viable theory of uncertainty. We embrace probability theory as the preferred logical framework for predictive modeling in engineering practice.
  • For example, one cannot assign the probability of occurrence of any failure mode that one does not yet know possibly exists ... a point seemingly overlooked by many risk analysts.
  • There is no guarantee that all object features in the domain of state variables are physically observable.
  • BS is the standard Borel σ-algebra generated by the object state space S.
  • Recall that a protective system is comprised only of elements that are mandated by regulatory oversight (i.e., its elements would not be present within the enterprise operations control volume in the absence of regulatory requirements).
  • The object’s (limiting) availability often a metric of interest and is taken as A≜limt→∞EAtFtZ, when it exists.
  • In sharp contrast to our analyses, PRA, QRA and PSA require that the state space S be almost surely (a.s.) composed of only persistent states none of which are not absorbing. The practical consequence of this state classification requirement is explored in Section 5.
  • Support for this assumption requires the state (normal) to be observable.
  • A good system for setting level of consequence is the International Nuclear Event Scale (INES). In this system, we would call events similar to those having INES Level 4 or above as an accident. The NRC defines an Extraordinary Nuclear Occurrence (ENO) in 10 CFR 140.83.
  • Consideration of disastrous events is from a public policy perspective and focuses on collateral economic harm to people and/or the environment outside of the security fence.
  • Ground water contamination sourced from the Savannah River Site facility continued developing over many years, with an accident being discovered long after the facility first enter into distressed states of operation.
  • The arrival processes T and TC play a prominent role in PRA, QRA, and PSA. However, studying their corresponding arrival counting processes in order to access important results from the martingale calculus is preferred.
  • The terminology unknown-unknowns was popularized by former United States Secretary of Defense Donald Rumsfeld.
  • Unknown-knowns also bias predictive models of initiating events.
  • EAtFt represents the equivalence class of random variables satisfying the definition of conditional expectation on the probability space (Ω,F,P).
  • Recall from Section 2 that the state of system protections At at time t, does not influence the arrival times of future initiating events.
  • The adequate protection language appears for example in Sections 182 and 189 of the Atomic Energy Act.

Written By

Ernie Kee and Martin Wortman

Reviewed: 17 January 2023 Published: 17 February 2023