The aim of the chapter is to examine thoroughly the aspects connected with risk management in complex organisations; it starts with the consideration of theoretical models of accident’s analysis in which the accident is attributable not so much to a single worker’s malpractice, but rather to deficiencies in the whole organisational structure for the development and control of processes, and concludes with the concept of the ‘resilient system’, i.e. dynamic organisations in which the very idea of safety and its applications in the objective context evolve and adapt to the changes in situation by learning from their mistakes. With this in mind, the chapter aims to emphasize that actions, practices, procedures and controls, as well as human relationships, communication, managerial and organisational policies, incentives, disincentives and reward and penalization system, are all determining factors in the reduction of risk and should consequently all be taken into account in the logical process of continuous improvement which is fundamental to any virtuous system.
- risk management
- complex organisations
- work accidents
- accident’s analysis
- Swiss Cheese Model
- resilient system
Since the mid-nineteenth century, with the advent of so-called ‘mass industrialisation’ and the subsequent birth of capitalist society, in most industrialised countries, we have witnessed the spread of complex political and cultural phenomena. These juxtaposed on the one hand entrepreneurs, who were seeking to maximise the profits made possible by newly introduced production technologies, and the working class, which became a victim of the increased cost of living associated with meagre wages and working conditions that were often inhumane and in which it was forced to operate. This led to the inescapable necessity for workers to join forces to deal with the ‘contractual dictatorship’ of entrepreneurs, drawing the attention of the public authorities and stimulating them to intervene in the enactment of laws to protect workers, not only in terms of their wages, but also to safeguard their physical integrity.
This led to the spread in most industrialised countries of the first laws to protect the health and safety of workers. However, for many years, these laws were merely designed to provide compensation to the injured, because they were intended as a tool to ‘repair the damage’ rather than focusing on implementing preventive measures aimed at reducing the number and severity of accidents . The latter in fact were mostly considered, according to the culture of the time, as the result of fatality, or in other words, as the unavoidable consequence of the advent of mass industrialisation that had led to the spread of increasingly effective but potentially hazardous production technologies.
It would take another few decades (starting from the middle of the last century) for the public authorities to realise the need to implement a preventive system that would aim to reduce the occurrence of injuries rather than to limit their consequences in a mere logic of compensation. The new preventive approach paved the way for the creation of that branch of knowledge related to the study of injury dynamics linked to the spread of increasingly sophisticated production technologies, as well as increasingly complex business management and organisational systems. The aim of these was to find new operational tools of management that could contribute significantly to reducing accidents and thereby ensuring healthier and more efficient working conditions in terms of costs and expected results.
The same preventive approach has undergone a deep conceptual evolution since its birth, shifting from considering injuries as a consequence of the spread of increasingly sophisticated and dangerous production technologies (1950s–1960s) towards a view of injuries as the result of man’s failure, unable to operate safely (1970s–1980s), and eventually leading to more modern interpretations of organisational and cultural kind (1990s–2000s). The latter were implemented in more recent years based on analyses that focused on the interactions between man and the environment in which he operates, and between technology and its practical use. This was done by establishing rules and procedures where man is considered the main actor who conceives, designs, implements and manages the whole organisation, interacting with machines and with others through communication and social and interpersonal relationships .
The 1950s, therefore, have seen the birth of the first major series of studies on risk and safety management in the workplace, as a result of the rapid industrial and technological developments of the post-war period. As we have seen, these studies focused mainly on the failure of technology and then on the design and construction of technological artefacts . According to this perspective, preventing injury was to be achieved by improving the reliability of industrial machines and their accessibility, and by reducing the ‘residual risk’, that is, making them ultimately safer. The machine, placed at the centre of the organisational and productive system, guaranteed the success of the corporate mission. In other words, it was seen as having the ability to stand up to the market, to be competitive and to progress. Based on this viewpoint where everything revolved around the technology, man also became less important. The worker was seen as the mere executor of practical actions that were often repetitive and insignificant compared to what the machine was capable of doing. As a result, even with regard to injuries, man was rarely seen as being directly responsible for what happened, whereas the responsibility was rather attributed to the machine in terms of their lack of reliability and safety. Accidents, as a matter of fact, were often a result of malfunctioning equipment and technologies, as well as deficiencies in the working methods associated with them. Fires, explosions and failures were frequent and the ‘residual risk’ inherent in each machine was very high. Reducing injuries would have been possible by improving the design and development of new technologies, making them more reliable and safe by implementing their reliability and by adopting effective security systems that would make the occurrence of accidents less likely.
Twenty years later, starting with the 1970s, scholars began to look for the root causes of accidents no longer solely in the dangerousness of machinery and work equipment, but rather in the failure (mistakes) of workers . In fact, technological progress had evolved very quickly and the result was that more reliable and safer machinery, equipment and systems had become available. Even the so-called ‘residual risk’, i.e. the one considered acceptable in relation to the context and the complexity of the machine operated, had dropped considerably, thanks to the introduction of technical standards that required a careful analysis of the potential hazards during the design stage.
Then why were accidents still occurring? The answer according to scholars was to blame man: if an accident happened, it is because man operated the machine improperly. This shift of perspective was favoured by the idea that only by looking at the individual causes, it is possible to come to an understanding of the factors that caused the accident . This approach is based on the individual blame logic, namely on a ‘accusatory perspective that focuses on the errors and omissions of individuals, with the belief that expert actors wouldn’t make mistakes’ . Human nature, therefore, is not seen as inherently fallible, but the idea is that we can think that the action of man can be compared to that of machines, and as such demand the same reliability from it. If an operator has been properly trained and has sufficient experience, he will not make mistakes!
This claim, as will be seen later, is partly true because man by his very nature will never be infallible. However, he will approach infallibility to the extent that he is put in the position to operate at his best, which will be linked not only to his level of training and familiarity in operating with the machine, but also become possible by having considered the production system as a whole, designing machines in line with the operational needs of those who operate them. However, in those years, the efforts to deal with injuries (which were seen mainly as a result of human errors) were directed at people ‘in the front line’, pinning the ‘blame’ on someone and then removing the ‘bad apples’. The person approach followed a logic of disciplinary type that did not involve any intervention at a systemic level, but that rather triggered a ‘blame culture’, which as we will see later is not conducive to the identification of errors, and as such, ‘prevents the system to monitor the critical issues, learn from its mistakes and improve as a whole’ . The person approach, therefore, focuses on actions, the direct source of the errors, whereas these same actions are seen as stemming from aberrant mental processes such as forgetfulness, carelessness, negligence or imprudence . In this way, the errors and accidents, but also the so-called ‘near misses’, are read and investigated as the direct result of the characteristics of human nature.
Later, in the late 1980s, a different perspective emerged, referred to as the socio-technical culture. This is an innovative approach that investigates the causes of errors and accidents by analysing the interaction between human and technical factors [7, 8].
The underlying assumption of the new socio-technical perspective is that accidents are caused by shortcomings and flaws in the controls and measures are put in place by the organisation to curb risky events [5, 9, 10, 11]. According to this approach, individual actors are frequently the heirs of the system’s flaws. It follows that efforts to remedy those errors should be directed to the organisation as a whole, in order to improve the defences and remove its pitfalls . The socio-technical perspective also helps the spread of the culture of safety, promoting organisational learning and improving the organisation through both reactive and proactive methods . Organisational accidents result from the concatenation of several latent factors that contribute to the event and that originate at the different levels of the system [13, 14]. The socio-technical approach, therefore, considers ‘the system’, understood as the manufacturing organisation as a whole, as the real culprit of failure, and within the system man is seen only as one of several elements that have contributed to causing the error. With this perspective, we understand how actions, practices, procedures and controls, but also human relationships, communication, managerial and organisational policies, incentives and disincentives, and rewarding and punitive systems, are all factors that determine the success or failure of the entire organisation. This is true both in terms of production and the safety of workers, and these factors all need to be considered in the logic of continuous improvement that underlies every virtuous system.
This logic is the defining backbone of the latest studies, which examine the cognitive and organisational processes, here seen as favouring the reliability of organisations, reducing errors and improving safety conditions [15, 16] but also, as we will see later, of other studies linked to Resilient Engineering, where the organisational systems are seen as dynamic processes that must continually adapt to achieve their goals, react to changes in the environment and manage unexpected events.
Ultimately, it is impossible not to note that the study of injury dynamics has led, hand in hand with the evolution of socio-technical organisational models, to the growth of the science known as ‘Ergonomics’, into which several fields of study have converged, such as psychology, medicine and engineering .
More specifically, we have seen the evolution of ‘cognitive ergonomics’, which aims to study the interaction between individuals and technology through the development of models and tools for predicting human error, reducing mental workload and to provide guidelines for designing machines that take into account the limited possibilities of the human cognitive system .
We ultimately have returned to focusing the attention on the machines, not only to improve their inherent reliability, eliminating the hazards and reducing ‘residual risks’, but also rather trying to design them with the person who will then use them in mind, so that the user cannot make mistakes. In other words, the machine, operating tool and equipment, but also the organisation as a whole, are designed and manufactured to suit the needs of man, starting with what are his weaknesses and helping him prevent mistakes. ,se ‘intelligent’ designs are, for example, equipment items made with clearly different shapes and colours, depending on how they are to be used; tools that do not allow their operation in potentially hazardous or unsuitable sites; conspicuously diversified bottles designed to contain different substances; machines that do not permit incorrect settings or operation; and robots that are an aid to man in production, helping him to carry out operations with pinpoint precision, well beyond common human potential. In this case the tool, equipment, housing, container and the very machine mould and characterise themselves, seeking to meet the needs of their users as closely as possible, considering such weaknesses as: distraction, tiredness, stress, limited vision or inattention.
2. Risk management in complex socio-technical systems
A socio-technical system is nothing more than an organisation made up of people (human resources, which are the company’s personal organism) and technologies (instrumental and production means). It is the interaction between man and production technology, as well as between man and man, that allows the system to operate, transforming input elements into outputs of potential interest to the market [21, 22].
We could compare the business to a human body, where there is a brain that plans, manages and sorts, in order to pursue the objectives, ideas and ‘mission’. The inputs are given by the brain and through neural connections they reach the various organs that make up the human being, thus allowing the action, namely the achievement of planned goals. This synergy between the decision-making organ and the executive organs allows the human being to act, performing tasks of various kinds, from seemingly simple ones such as eating, walking or talking, to more complex ones such as practising a sport, learning a trade or playing a musical instrument.
In corporate organisations, the brain can be represented by the senior management, or by the person or people who being at the top of the organisation define the production strategies, ultimately defining the ‘what’, ‘how’, ‘how much’ and ‘when’ to produce a specific output, which is the result of the business objectives and strategies. From senior management, orders branch out and through a variety of communication channels (similar to neuronal connections), the inputs reach the departments responsible for the action, i.e. the production units. By doing so, through a series of actions involving men, tools, technologies and know-how, the idea takes shape and materialises in the finished product, i.e. in that output that reflects, or should reflect, the goals, ideas and design requirements established by senior management.
A system organised in this way can therefore be more or less complex, and its complexity will have repercussions on the organisational, procedural and management capacities that allow it to operate, including those responsible for the safety and well-being of the people who work there .
Where does the complexity of a socio-technical system of this kind reside? It resides in the number and typology of the elements that form it and in their interactions, staring from the people who are part of it, as well as the technologies involved and the manufacturing system used.
However, there is a special feature that clearly distinguishes a complex socio-technical system from a human being. Man’s complexity is not tied to the probability of making mistakes, as is the case, instead, in the socio-technical system, in which the interdependence between the various component parts is often the cause of the sometimes unexpected failure of his work.
As a matter of fact, it is precisely the intrinsic complexity of the human being, the result of millions of years of evolution, that gives him a better chance to succeed in carrying out complex actions and that ultimately is what sets him apart from animals and makes him a superior being, guaranteeing the conservation and dominance of the species.
Humans make mistakes, of course, but the mistakes they make (which are the fruit as we will see later several factors such as: inexperience, distraction, incompetence, difficulty of the action to be taken, etc.) do not stem from the intrinsic complexity of the human being. Playing the piano, for example, is an obviously complex action that requires coordination, motor skills, rhythmic sense, concentration and knowledge of music and the instrument. However, what enables man to play the piano is precisely the fact that he has evolved into a complex organism, in which the parts of the system participate in unison to the action’s execution. Each element moves in harmony with the others, thereby ensuring the same level of concentration, coordination and commitment, and reacting similarly to external stimuli.
The human being is composed of a myriad of different cells and the brain itself is made up of cells grouped into several distinct parts, each of which is responsible for overseeing different functions. However, everything is coordinated properly in order to ensure that the processes responsible for carrying out an action move in unison in the same direction, pursuing the same objectives and eliminating the unpredictability of the consequences of their behaviour, have a clear understanding of the ‘what’, ‘how’, ‘how much’ and ‘when’ to take a specific action. This complexity ultimately becomes a successful weapon.
The parts of a socio-technical system are also designed to work together to achieve the common goal. This interdependence, however, as mentioned, despite being led by a single vertex consisting of a small group of elements, cannot reach the same levels of synergy, harmony and coordination in sharing a common purpose, intent and goals, as is the case for the elements that form the human body. Therefore, the number of elements making up a socio-technical system increases their complexity and, consequently, the likelihood of failure, including the unpredictability of the consequences of its actions.
In literature, we find three types of interdependencies between the components of a socio-technical system: the ‘generic’, the ‘sequential’ and the ‘reciprocal’. These refer to exchanges of the inputs and outputs of each unit and in a way also define their complexity and the potential organisational interferences . With the exception of generic interdependence, in which a compartment or a branch of an organisation typically has full spatial, organisational and decisional independence, and its dependence to the system lies objectively only in the fact that its survival in the market is linked to the survival of the parent company or the partner, in the other two, instead, the interdependence is more restrictive. This is because in the sequential kind, a system unit’s output becomes an input for another, and in the reciprocal kind, they continuously exchange inputs and outputs in order to add value to the finished product. In this order, the three types of interdependencies have increasing difficulties of coordination, as they contain an increasing degree of complexity, constraints and uncertainties.
In this light, it is clear how often the safety level guaranteed to a process cannot be separated from what is established in a different process and how ultimately many of the errors that cause a failure in terms of safety can arise from objectives that are not shared in common by the different parts that make up the system, or from the difficulty of working in synergy, as is the case, instead, between the parts that form the human body.
The time factor, for example, is one of the elements that most often contributes to the occurrence of an injury. Operating in a hurry on a tight schedule reduces the safety margins, causing an increased likelihood of making mistakes. As such, it is essential to guarantee each component of the system the time needed to safely perform its actions under a single supervisor/manager that manages time as a resource fairly, effectively and efficiently. However, in reality, this often does not happen and some processes are carried out in a time that is not appropriate to guarantee that they are accomplished safely.
Leaving aside the case where a worker or group of workers decide independently and without a valid reason to act in a hurry, the reason why this happens is to be found in any of the following: the former might lie in the fact that the time estimated by management to complete the process has not been properly planned out from the beginning; the second that upsets have altered the schedule for the process, which are in turn due to predictable or totally unexpected events.
Ultimately, therefore, in the first case, the resource was incorrectly quantified, while in the second case, an unexpected or unpredictable event occurred during the course of operations that has changed the surrounding conditions.
Looking at the first case, the question is: why did management miscalculate the timing of the process? The answer is often to be sought precisely in the inherent complexity of socio-technical systems, starting from the analysis of the people and processes that influenced the final decision.
Contrary to what happens to the constituent parts of the human body, which as we have seen, work in perfect synergy with the objective of ensuring the proper functioning of the whole organism, in the case of socio-technical systems the parts involved, at times, pull in opposite or conflicting directions, without being able to find an effective compromise. Reducing the production time, for example, is a goal of the management to cut down on costs. Therefore, the final decision may have been biased towards the goal of maximising profits, rather than to act according to high safety standards, underestimating the consequences that this may involve for the entire system.
Obviously, the greater the number of components in the system acting to pursue different objectives, the greater is the possible errors caused by incorrect mediation between the interests at stake. It is not uncommon that different objectives of the parts of the same system lead to inaccurate choices regarding the resources made available, which, therefore, will be to the benefit of one or more of the parts and to the detriment of others, with obvious repercussions on the management, organisation and safety of the latter .
The second reason that may have led to the shortage of time necessary to complete the process safely, instead, stems from the sequential or reciprocal interdependence between processes.
This is the case in which, for example, an unexpected delay in procurement has led to less time being available to complete the processes, i.e. disruptions have occurred that have changed the conditions originally forecast when designing and analysing the work stages.
The complexity of the system as a whole such as the number of procurement sources required for the completion of the process, the number of variables which in turn have influenced the timing of implementation of each of the procurement processes, as well as the procedure for the exchange of output and input elements between processes are just some of the factors that may have determined the problem.
As mentioned, sometimes the problem is predictable and therefore the system should be aware of the consequences it entails and should have already planned the corrective actions. At other times, instead, the problem is not predictable or is considered so remote as not to require a detailed plan of the actions envisioned to tackle it. This is especially true of complex systems, in which not everything can be planned in detail, since many of the elements that make up the system involve a high degree of behavioural unpredictability. In this case, the key to success for the organisation lies in being able to react quickly to the unexpected event by enforcing immediate compensatory measures to avoid the error, even if these measures were not originally planned and it is not possible to act according to a predetermined and well-tested pattern .
It is evident, however, that the complexity of the system, influenced not only by the number of elements that form it, but also by the number of variables that it handles, increases the probability in generating unexpected events. For these reasons, it is strategically important that the same system be able to react quickly to the unexpected events, applying those compensatory strategies that enable it to prevent the deviance from becoming a source of failure.
This is what, as will be seen later, is identified as the organisation’s resilience or, in other words, the capacity of complex socio-technical systems that makes them capable, besides planning in detail the processes and procedures that can ensure that synergy of purposes and intent necessary to coordinate the various parts that form it, also to respond effectively to the unexpected event, being flexible and able to adapt to changing operating conditions, in order to find the key to success even when external elements intervene to disrupt the plan.
Sometimes the risk in a complex organisation can arise due to the fact that its components are ignorant of what the other production units have done, in spite of having shared resources such as the workspace or technology, and as such spurring harmful interferences.
It is well-known, in fact, that most workplaces operating simultaneously increase system performance as a whole, allowing it to manufacture the product in less time to the benefit the majority of stakeholders. It is therefore normal that one or more system components will push for streamlining the workspaces and production times, opting for actions that are carried out simultaneously in the same space. However, it is also a known fact that the sources of many events of injury are overlapping stages that occur in confined spaces, which consequently increase the risk of interference.
This is the case, for example, in large construction sites, places where multiple businesses often are working at the same time and that belong to the same socio-technical system responsible for completing the work, but each focused on its own activity and often unaware of what the other companies on site are doing. A general supervision that allows for the best coordination of these activities and that is able to mediate between diversified needs, guaranteeing that each has appropriate resources in order to produce its best, is what is necessary to ensure that the parts involved are driven to act in the common interest, avoiding that the actions of one can adversely affect the other .
The simultaneous presence of companies with different interests that are competing in the choice of technologies can sometimes generate system failures and increase the risk for the safety of those who work there. A technological resource that is not appropriate to execute a specific process is, indeed, another factor that often contributes to the occurrence of accidents.
Leaving aside the hypothesis of deliberately reckless choices, for which the identification of technological resources is the result of flagrant violations of safety rules, a department could find itself working with equipment that is unsuitable for the work and therefore potentially dangerous, seeing it was chosen based on considerations arising from conflicting interests within the organisation.
There are several examples of this. Sometimes a technology is inadequate simply because it is outdated. Conflicting interests of the system’s parts, in fact, could lead to the company not replacing old equipment, which although possessing the minimal security requirements to ensure compliance with the applicable technical laws and regulations, do not guarantee the same levels of reliability, safety and ergonomics of more modern machines. The reasons for not replacing the fleet of machines are often economic in nature and sometimes linked to investment planning, depreciation or the accumulation of capital, or the occasional use of the resource does not justify the upgrade to new machinery, or more simply, because the company does not have the necessary funds.
In other cases, which find their source in the mediation of different interests of the system’s parts, a technological resource is pushed to produce to the limit of its operational capacity precisely because of the need to maximise the yield. This brings the process to always work on the borderline of what is an acceptable residual risk and a high risk, a bit like what happens with Formula 1 race cars, that are driven always to the limit.
At other times, instead, the error is generated by sharing a technological resource and the simultaneous need of the different system’s components to use it. This is the case for example of machines or equipment that are used for multiple tasks and that consequently require continuous changes to the settings, programming and configuration, because of the need from time to time to adapt their operation to the needs of those who use them. There is no doubt that this type of choices could prove to be strategic in the intent of streamlining the use of resources, meeting the needs of some of the system’s stakeholders. But, on the other hand, the need for frequent actions to change the settings or operating modes represents another source of possible error and the characteristic elements of a complex production system that is constantly called to mediate between different needs at the risk of increasing the likelihood of a mistake and ultimately its vulnerability.
The above are only a few examples of how the complexity of the socio-technical system—understood not only as number and variability of the elements that form it, but also as different concurring objectives, as well as the unpredictability of the actions and reactions of the system’s elements—might increase the instability of the system as a whole, generating those internal deviations that are the source of the error .
Ultimately, we cannot fail to mention the complex nature of the people belonging to the socio-technical system, though this time not as elements of the system responsible for monitoring parts of it or, in other words, as those who act in the interests of the process they manage, pursuing objectives that are sometimes common and sometimes conflicting, but to the people as such, understood as human beings and as workers.
We have mentioned that the complexity of the socio-technical system arises from the complexity of the elements that form it. As is known, there is nothing more complex and unpredictable than human behaviour.
In the comparison between the human body and a corporate system, the people who form a company are likened to the parts of the body, each of them specialized in performing a certain function and in a continuous relationship of exchanging with the other. The parts of the human body, however, always act as planned, not introducing random variables when carrying out the actions planned. A system of this kind, therefore, achieves that harmony between the parts that allows it more easily to operate in full synergy.
Every individual, by contrast, is different from another. Everyone has his/her own personality traits that influence the behaviours and relationships with others and has different needs and reacts differently to external stimuli. It is true that a person’s actions in practice follow what has been established by the organisation, with the aim of achieving the common goal. It is true that training, a culture of safety, work practices and procedures, as well as orders and controls, incentives and disincentives, besides practice, simulations and exercises, represent elements that guarantee that every human resource of the system acts in accordance with the established rules, in the time and manner required, and in harmony with the surrounding environment. But it is equally true that man is neither a robot nor his is behaviour similar to that of a cell in the body. Man acts always preserving that intellectual independence, the result of complex biochemical reactions at work within him, which are in turn influenced by past experience, personal convictions, the culture in which he grew up, the society in which he operates and the behaviours of others, and which ultimately make him unique and unpredictable.
It is also true that in the uniqueness and unpredictability of the human being also lies his genius, the ability to tackle new problems, to handle the unexpected and to act outside the box to find effective solutions even in the event of a problem or emergency . Man, therefore, is the strength in the management of complex systems, the key asset to rely on in order to develop systems that are able to ensure success, regardless of their degree of internal complexity and of the number of constraints or instability.
3. Seeking the causes of failure: reason’s model
The analysis of the dynamics at the origin of an injury shows that a single error is rarely sufficient to generate the failure, i.e. to lead to serious consequences for the system. Rather, a failure is the result of a chain of latent errors, which can be attributed directly to the human resources that form the system or the technologies available to it, but which are often underestimated by those who make them or who monitor them, not paying due attention to find a solution to them . The theory of errors has evolved hand in hand with the development of complex socio-technical systems, trying to explain, at least ideally, how they are not merely the fruit of incomplete or incorrect human activities, but rather the end result of a series of events (latent errors), accepting the existence for each accident that occurs of many near miss events.
It is based on these principles that in 1990, James Reason proposed a model to explain the dynamics of accidents, adopting a logic which, although a bit outdated nowadays, continues to provide a good pattern for understanding the genesis of accidents in complex systems . Reason’s model is also called the Swiss Cheese Model, as it is portrayed as a set of slices of cheese, each of which represents a defensive layer put in place by the organisation to prevent the occurrence of an adverse event. These barriers, however, contain holes, typical of Swiss cheese, which represent the errors, both active (slips and lapses) or run-time errors related primarily to man, and latent (mistakes), which originate from bugs in the design or organisation and overall from the systemic management of the organisational reality . The adverse event will occur only if the holes of the various slices of cheese are aligned along a ‘trajectory of opportunity’ or, in other words, if we encounter a series of active and latent errors that will prevent the barriers to be really effective, thereby leaving room for the occurrence of the ‘top event’, understood as an accident. This model, although not immune to criticism, is able to represent what happens in most circumstances when an adverse event occurs, linking the responsibility for what happened to a deeper analysis of the entire organisational system. The model groups the barriers (slices of cheese) in four different types that are associated with four different ‘mistakes’. The last barriers are bypassed by those mistakes that directly caused the accident or so-called ‘Unsafe Acts’, final errors typically made by the operator or caused by the technology malfunctioning at the end of the process.
If we run the chain of events from end to top, we find that these errors are often favoured by ‘Preconditions for Unsafe Acts’ or conditions that favoured the performance of the unsafe act or the malfunction in general, and for which insufficient barriers were placed, i.e. the necessary organisational and technical measures to prevent their occurrence. We then have so-called ‘Unsafe Supervisions’, that is, errors made by those who were supposed to supervise the actions of others, guaranteeing success, to arrive at ‘Organisational Influences’ i.e. those systemic deficiencies inherent within the complex organisation.
The application of this model, therefore, allows us to have a clearer view of what happened in the event of an accident, according to the logic of not stopping to consider only recent episodes of the process that led to the failure, but pushing the analyst to conduct a deeper search, leading him backwards in the investigation of the systemic preconditions inherent in the organisation itself.
In light of the above, below I provide an example of the application of Reason’s model to a case that actually occurred, in which the analysis of the origins of the error allows broader considerations as to the evolution of the events, and consequently identifying also the remote causes of the failure.
The case: a lady is hospitalised due to a neoplasia with ovarian metastasis at a clinic specialising in the treatment of cancer. Following analyses in the Department of Gynaecology-Oncology, the doctors suggest an ovarian surgery followed by chemotherapy. The lady accepts and after the surgery, which was also perfectly successful, begins the first cycle of chemotherapy. The trainee specialist issues the prescription for the chemotherapy drug based on instructions from the doctor in charge of the unit, who, in addition to the therapeutic treatment, also prescribes a protective drug to reduce the toxic effects of chemotherapy. The head doctor makes a mistake writing the dose of the protective drug, confusing a comma and prescribing a dose equal to one tenth of the one prescribed by the trainee specialist. In the ward, the prescription is not checked and the lady starts the therapy. Four days later, she is hospitalised due to chronic kidney failure, reporting deafness and displaying obvious signs of intoxication from the chemotherapy drug. She is immediately put on dialysis and dies 18 months later due to complications with the dialysis. A superficial analysis of the accident would lead to focus solely on the events directly related to the occurrence of the error, i.e. the fact that the trainee specialist wrote the wrong dose indicated to him by the head doctor and therefore making the mistake that proved fatal.
Reason’s model applied to this case allows us to identify, by tracing the events backwards, all those barriers in the system that did not work, either because not working, not present, removed or purposely bypassed by the chain of events.
The model starts by considering the ‘Unsafe act’, that is, the last mistake, which in this case was made by the trainee specialist. This error can be caused, as we shall see in detail later, by a number of reasons related to the individual, including lack of training, negligence, incompetence or wilful misconduct. In any case, these reasons are tied with the last action, i.e. the one directly involved in the failure, but that does not consider the accident as a whole. The model, as said, looks deeper, considering the ‘Preconditions for Unsafe Acts’, that is, the preconditions of the health socio-technical system that were supposed to limit the occurrence of ‘unsafe acts’ but that instead have determined the conditions that led to the error. In the case we have just examined, the trainee specialist evidently is still not sufficiently familiar with the correct doses of the drug to prescribe for him to notice the error that he is committing and therefore cannot correct it, so that his actions should have been monitored by someone who had more experience. Another precondition that encourages the mistake, typical of the academic world (but other fields as well), is that the trainee specialist (a student) probably did not feel like asking the head doctor for an explanation about the dose communicated to him a second time. It is typical of students, in fact, to think that asking for an explanation about something that is likely considered ‘common ground’ or ‘obvious’ is a ‘dumb question’, because they would see it as tarnishing their reputation, as well as proving an embarrassment in the eyes of fellow specialists. This means that to the simple question: ‘excuse me, did you say grams or milligrams?’ out of shyness, awe or fear to lose face, the student prefers to keep silent, thus causing the error to propagate.
The relationship between a student and professor, considered a ‘precondition for unsafe acts’ would broaden the analysis to sociological, psychological and communicative considerations, also taking into account human relationships and psychosocial risks that could affect the evolution of events.
These dynamics related to interpersonal relationships and effective communication are well known to air force operators, who since the 1970s have created and adopted a system called Crew Resource Management (CRM)  with the aim of facilitating communication at all levels, in view of the fact that the analysis of injury events pointed to the fact that in several cases of accidents, at least one member of the crew had spotted an abnormal event, but without, however, being able to communicate it effectively to the commander . This management system focuses on the prevention of human error and consists of a set of principles and behavioural and attitudinal models that offer everyone the opportunity to examine and improve their behaviour. Specifically, CRM aims to foster a climate or culture where authority may be respectfully questioned. This is a delicate subject for many organisations, especially ones with traditional hierarchies, so appropriate communication techniques must be taught to supervisors and their subordinates, so that supervisors understand that the questioning of authority need not be threatening, and subordinates understand the correct way to question orders. Another possible ‘precondition for unsafe acts’ is the one connected to the analysis of the environment in which the actions that caused the error were carried out. It is well known that noisy environments where there is a lot of confusion or there are possible distractions tend to be conducive to errors. In light of this fact, the Sentara Hospital in Virginia introduced an area near the drug dispensers called ‘no interruption zone’, in which it is absolutely forbidden to speak. This is meant to avoid possible distractions from the nurses responsible for choosing the drug and is a solution that significantly reduces errors due to the improper administration of drugs . Tracing our way back in the analysis of events, we find barriers in the system regarding ‘Supervisions’, which are the tasks involved in monitoring actions. The head doctor should have double-checked the prescription filled out by the trainee specialist before signing it, as he should have been aware of the chances of him making a mistake, precisely because he is a student. Double-checking also happens to be an effective and widespread tool in a complex socio-technical system because of how efficient it is at avoiding mistakes. This is a standard procedure in the most critical processes, where the consequences of an error may have a strong impact in terms of damage. A doctor or other specialist double-checking the prescription would alone have been enough to put in place an effective barrier to the error, thus avoiding its spread. Even in this case, though, probably due to lack of time, incompetence, or excessive confidence in the trainee’s skills, this was not done and the prescription was passed on to the ward as if it were correct. At the level of ‘Unsafe Supervisions’, we need also to underline the lack of controls in the ward. Here too, the ‘barrier to error’, i.e. the safety procedure, would require a doctor in the ward to check the prescribed therapy a second time when defining the health protocol. But again, unfortunately, no one did, leading the nurse to administer the wrong dose of the drug. At the level of ‘Organisational Influences’, that is, with reference to those systemic deficiencies inherent in the organisation, as mentioned, we have to seek the remote causes that led to the primary conditions for the occurrence of the accident. In this case, it should be noted that the head doctor is the only person authorised by the organisation to issue prescriptions for medical treatment, as he is the person who signs the protocol. This procedure is aimed to avoid possible errors of misunderstanding on the quantities or the names of prescribed drugs from the get-go. As we have seen, however, this did not happen. One possible reason can likely be the constraint to save time. Independently of considerations that are specific to our example, in which the doctor may have deliberately committed the irresponsible acts he is blamed for solely due to negligence or because he intentionally decided to break the rules, from a systemic point of view, not having enough time to perform one’s duties, as we have already discussed, is a symptom of an inefficient organisation that is often understaffed and forces employees to work on taxing shifts and with excessive workloads.
A corporate reorganisation that better factors the activities to be performed together with the resources to be used, encouraging mobility, dynamism, waste reduction and resource optimisation, and that ultimately promotes ‘continuous improvement’ through the continued implementation of its processes, is often the decisive response in limiting the conditions that are conducive to the errors. Probably, in a more relaxed environment and with more time, the treatment plan would have been filled out correctly or would have been checked by the head doctor, avoiding the error’s occurrence. At a systemic level, we again need to consider the ‘wrong practice’ or approval of inappropriate behaviours as if they were normal, which, as we will see later in the systemic and organisational theories related to failure, is what often takes the name of ‘normalisation of deviance’.
Everyone knows it is not allowed to have trainees issue prescriptions, much less (but as we unfortunately see happening in some healthcare facilities) to have them carry out complex tasks such as making diagnoses or prescribing treatments without proper supervision from the head doctor. But, again, we unfortunately see this happen in many healthcare facilities around the world, where interns replace doctors, as a way to cope with the chronic shortage of staff that often plagues many healthcare systems.
A risk inherent in the blind acceptance of the Swiss Cheese Model, however, is to make the assumption that the more slices of cheese are placed in sequence, the better the chance of avoiding errors. This assumption translates into the organisation having as a priority the need to implement as many defences as possible without, though, considering the repercussions that this can entail for the reliability of the whole system, such as the emergence of new problems (failure mode, directed or caused by interactions between the barriers) . For example, theories on the reliability of regularly servicing machines and plants view disassembling the machine or plant and then inspecting and reassembling its components as being certainly effective and therefore a barrier to error (failure/breakage). Performing such maintenance tasks undoubtedly increases the number of barriers (slices of cheese), as it allows staff to thoroughly inspect the machine/plant for possible faults or conditions that could lead to a possible system failure. According to the assumptions inherent in Reason’s model, this practice ensures greater reliability to the system. In reality, however, the result is not always an increase in reliability, because when disassembling and inspecting the components, the errors that can occur when having to reassemble them also increase. The new barrier (slice of cheese) must indeed be viewed as a defence, but also as a possible source of holes on other slices of cheese, holes that initially were not present but that were caused by the introduction of the new slice. Another limitation of Reason’s model is to consider errors as resident system pathogens, similarly to how we consider viruses in the human body when they have not yet manifested their harmfulness or, in other words, as harmful entities that can actually wake up suddenly and cause a disease in the body . This assumption does not actually always reflect what is happening in the genesis of the error in complex socio-technical systems, because viruses are purely harmful agents, while the actions that the system performs and that can lead to failure are, as stated by Reason himself , the same that sometimes lead to success. The consequence of this is that we cannot always consider the error as necessarily harmful.
4. Man as an essential resource for solving accidents: the rise of resilient systems
As a matter of fact, machines do not make mistakes as long as no unexpected events happen against which they are often not able to establish a viable alternative compared to the pre-set programme. ‘Man, then, is certainly less reliable than computers in some respects, but if the automatic system goes haywire, who but man can intervene to solve an unexpected problem?’ .
The Apollo 13 mission has passed down in history as one of the most difficult missions of conquering the Moon, in which a number of malfunctions and unexpected events brought engineers and crew members to make corrective actions, often without following a pre-planned procedure and breaching the mission protocol . The genius of man, in his ability to cope with unforeseen problems also by committing actions that technically had been classified as ‘mistakes’, was the key to the mission’s success, allowing him to avert disaster.
Already when the spaceship was coming out from the Earth’s orbit, there was a first dangerous accident. One of the five second-stage Saturn V carrier’s engines stopped working, forcing mission control to make the decision to use the remaining four engines longer than planned and thus acting contrary to the rules, in order to allow the mission to continue. Even the third-stage rocket engine was operated beyond its capacity in order to correct the deviation from the trajectory that occurred after experiencing the first problem. Shortly after the successful lunar module docking manoeuvre, 321,860 km from Earth, one of the four oxygen tanks of the command and service module (CSM) exploded, causing serious damage. At this point, it was decided to cancel the descent on the Moon and the engineers decided to focus their efforts on the re-entry manoeuvres. The damaged CSM module included a command module, which hosted the three crew members and which was in fact the space shuttle that would have allowed the astronauts to return to Earth. It also contained a service module needed for the first module to function, which in turn housed, besides the thrusters for re-entry manoeuvres, even the technical equipment to allow life on board and the oxygen tanks. Such a massive malfunction in the service module prevented the astronauts from using the command module for return to Earth, condemning them to a slow and unrelenting agony in space. However, because the lunar module (LEM), fortunately had already been hooked on the CSM when the plan was still to perform the Moon landing, it was decided to transfer the entire crew within the LEM and to use it to return to Earth. The lunar module, though, had been designed only to land and take off from the moon. As a matter of fact, once the astronauts had been allowed to return in the CSM, it would have been abandoned in space orbiting the Moon, having in fact completed its mission. The problems encountered clearly were not trivial. Everything had been turned upside down and the crew was now forced to navigate by sight, without relying on proven mission protocols. To begin with, a crew of three people was occupying a module that had only two seats. In addition, the module itself was preparing to face a 4-day long return journey to Earth, whereas it was intended to land and take-off from the Moon in a total time of 2 days at the most. As such, all service equipment designed to allow life on board was undersized compared to the needs. The crew, in fact, was forced to assemble the new carbon dioxide filters to be installed in the LEM while orbiting and with very few tools available, considering that those provided were not sufficient for three people and for twice as much time than originally expected. The same engines of the LEM, which had been designed solely for landing and taking off from the Moon, did not guarantee success if used for different purposes, such as to return to Earth. These factors sparked long and exhausting discussions between the engineers responsible for monitoring the engines at the base in Houston. Some of them, in fact, did not agree to use the engines of the lunar landing module for the shuttle to return to Earth, believing such a manoeuvre posed a serious risk to the crew’s safety. The lunar module engines were specifically designed to complete the landing mission, which meant there was no guarantee of the mission succeeding if they had been used for different purposes. Also, when returning to Earth, there would have been the need to turn on these engines several times to correct the trajectory of the LEM, but the engines had been designed to be switched on only once during lift-off from the moon. Other engineers, instead, considering the serious risk of explosion in attempting to reignite the engines of the SEM due to the damages consequent the explosion of the oxygen tank, were opting for taking the risk of using the LEM’s engines. Either way, a decision needed to be made quickly and in the end, the decision was to authorise the use of the lunar module engines to return to Earth, although this operation was not contemplated in any of the mission’s security protocols.
The re-entry operation fortunately was a success and astronauts Jim Lovell, Jack Swigert and Fred Haise managed to return to Earth safe and sound.
This experience is an example of how man’s ability to cope with unexpected situations, even acting against protocol and therefore committing ‘errors’ compared to the original plan, can sometimes prove decisive for success. The group of engineers who were pending for not using the lunar module engines basically reasoned on a ‘rule’ level, i.e. advocating the peremptory enforcement of rules. The procedures said that these engines were designed for a specific purpose, that there was no guarantee of success and that using them for different purposes was not provided for in the protocols, which means it would have been a mistake. Those who were in favour of using the engines applied a ‘knowledge’-based approach, that is, to use human intelligence to cope with the unexpected event, not following written rules, breaking some of them and ultimately inventing new ones.
This experience shows how, especially for highly complex socio-technical systems, in which a complicated process cannot always be planned in every detail and in full compliance with a series of strict rules and preconceived algorithms, since the variables involved can be endless and as such, can easily give rise to unexpected events, man’s ability to decide to breach the rules can sometimes be decisive.
‘Complex systems are always in dynamic equilibrium. They seek variety because it is only through variety that they evolve […]; sometimes it is positive variety, i.e. ‘good’ information that enables the system to succeed. At other times, it will be negative information (‘error’) and could make the system collapse or change it’ . An adaptive system, therefore, also learns from the negative information, which is ‘news’ that increases the system’s resilience.
We must not, however, think that to improve the reliability of a system, we have to wait for the occurrence of a serious accident. Improvement comes from analysing the errors, true, but considerably important among them, as we have seen, are latent errors, those holes in the system that caused minor malfunctions, weak signals for which the organisation should fine-tune its sensitivity to know how to recognise and correct them, capitalising on the information it gathers from them. Thanks to a safety culture (correct self-analysis and reporting procedures), we can then build on the information gathered from the analysis of the system and consequently correct deviations, sometimes even improving the rules and considering new scenarios. Risk management based on fixed procedures and established practice, in other words based on a “static management” in which everything is pre-planned, is not able to represent continually evolving systems such as highly complex socio-technical ones are . This is where the theory of ‘resilient systems’ stems from.
The term ‘resilience’, its original meaning at least, belongs to the physics of matter and means ‘the ability of an object to regain its initial shape after undergoing a deformation caused by an impact’. The Help Center at the APA (American Psychologists Association) defines resilience as ‘the process of adaptation in the face of adversity, trauma, tragedy and other significant sources of stress’. It is therefore ‘an individual’s ability to overcome unusual circumstances of difficulty due to his behaviour and mental qualities of adaptation’ . It means “‘finding a way out’ from difficult experiences” , in other words, find the smartest and fastest way to solve the problems. According to computer science, it is ‘the quality of a system that allows it to continue to function properly in the presence of faults in one or more of its constituent elements. It therefore indicates the system’s tolerance to faults, malfunctions and breakage’. Resilience is typical of high-risk organisations and their adaptive capacity is decisive against threats to the organisation’s integrity and hence, in avoiding failure . Some of the main features of a resilient system are: (revised by the publication of Steven and Sybil Wollin ; drawn from Fichera ):
Insight: that is, the system’s ability to examine itself, i.e. to gather the information that led to the traumatic event, analyse the problem, know how to read the elements relating to the context and seek alternative solutions.
Independence: the ability to maintain its own identity, while not isolating itself—autonomy and independence, without feeling totally bound by other systems or resources that they are able to provide.
Interaction and communication: the ability to establish satisfactory relations with its surroundings, entertaining more or less close relationships that, however, if necessary allow to develop the system’s aim.
Proactivity: the system’s ability to acquire and process enough information so that it can read its environment consistently and use it for its own purposes.
Creativity: the ability to create the ‘organisation’, i.e. a new functional order starting from the chaotic and depressive situation following the destabilising event.
Humour and irony: a disposition of the person who is part of the system to break away from the problem by belittling its importance through a ‘language game’ that allows an analysis and positive re-evaluation of the event.
Ethics: refers to the ability to feel part of a macro socio-political system and share its values and dynamics.
To clarify the meaning of the unique aspects of resilient systems, we will contextualise their underlying principles below. First of all, we ought to point out that it would be enough to effectively implement the principle of insight to achieve substantial improvements in the vast majority of cases in terms of the organisation’s reliability at large. The system’s ability to examine itself, to find elements that may have caused or led to the errors, lies in this principle. There are a number of methods to practically implement these analyses, but the starting point is always the systematic collection of information. Unfortunately, organisations are often lacking a valid system for the collection of data that are useful to monitoring their behaviour. Collecting data, in fact, is costly, both in terms of time and resources. It is also a boring job, which is often done by people who normally are engaged in other tasks and consider the needs to fill out forms and records as waste of time that hinders their job. Hence, records are often not kept or are filed only partially and sometimes retrospectively, trying to trace the chain of events and approximating most of the information they contain. A resilient system, instead, needs data and must be able to identify, through a careful analysis of the key elements inherent in the dynamics of the system itself, the strategic information deemed necessary for its implementation, and trying then to systematise and possibly standardise and computerise the data collected. If we were to make a comparison with organisations operating in aviation, it would be like having a black box for each event that occurred. In the event of a system malfunction, therefore, it would be enough to draw information from recordings of the events to get a clearer picture of what happened, and accordingly establishing what should be the improvement actions to be implemented to ensure that the event no longer occurs.
This requires that all the system’s resources are aware of the importance of participating actively in this collection of information, especially by establishing reporting elements that can be easily filed (preferably electronic) and that mention the key data as to what is taking place within the environment in which the resource operates. Ultimately, it is necessary that the system is able to identify the crucial points that require a record of the events, such as the correct filing of a medical record, but also records about data that is helpful in analysing the timing of the process, the results achieved vs. those budgeted and customer complaints or ones filed by other stakeholders, rather than reports on so-called near misses. Implementing the above enables the system to collect valuable data about its operation, but alone is not enough. It is equally important that the system has the resources and skills necessary to effectively analyse the collected records whenever it is necessary to investigate an adverse event or when we want to tap into statistical data that can help improve it. This is linked to the principle of proactivity, another feature typical of resilient systems.
Proactivity means moving ahead of events, predicting what will happen in the future through a careful analysis of what happened in the past. What has happened in the past is represented by a set of records, but it is then necessary that within individual organisations, operating units are established/implemented that will analyse and process the collected data and that are able to draw information from the records that is useful to achieve improvement. In fact, what would be the point of collecting and analysing such a huge amount of information if we then did not act to solve the problems identified? To do this, the organisation needs to have a unit that works as much as possible in contact with the individual departments and not only at a centralised level, in order to be able to thoroughly understand its needs and the singularities associated with the individual processes. It is then necessary that the unit be able to critically analyse the results of monitoring the process indicators, suggest solutions for improvement that are valid and applicable but above all, communicate effectively with those who then will have to take the necessary decisions and authorise the actions proposed. The system must ultimately be able to operate according to the logic and principles of total quality management, where a ‘senior management’ periodically reviews all information collected by individual ‘process owners’ through a set of business process ‘indicators’ appropriately chosen to monitor the system and take proactive actions aimed at achieving ‘continual improvement’ of the organisation.
The concept of ‘independence’ of a resilient system is less agreeable, as it apparently conflicts with what is happening in the world today. In fact, we live in an increasingly globalised world where each subject, whether we are talking about a country, organisation or market, is increasingly linked to the existence of others, because it constantly requires interaction and support. While recognising in the clustering effect of globalisation an indisputable advantage when it comes to streamlining resources, it is important that the ‘satellite’ organisation, whichever that may be, preserve its own identity within the macro system in which it operates and especially the possibility to manage the resources necessary for its operation as much as possible independently. The implementation of the above makes the organisation accountable and more prone to engage in ensuring the productivity of what it produces, in essence making it directly responsible for the success or failure, avoiding the fear of belittling the recognition of success and conversely discouraging the idea of being able to impute to others the causes of its failure. These aspects tied to autonomy and independence are the same that act at the level of the individual and that greatly influence his behaviour, making him more participatory or otherwise unresponsive towards the organisation. Systems that give incentives and reward performance, when combined with guaranteed freedom of action and thought, stimulate creativity and innovation, which are the basic ingredients of resilient systems. Independence, however, should not prevent the system from ‘interacting and communicating’ with its surroundings, or with other units of the same system or with other organisations. We find an example of how communication is implemented within a complex organisation to limit the occurrence of errors in healthcare organisations. A hospital, in fact, must be able to accommodate its patients and receive their relatives, limiting as much as possible situations of uneasiness and interacting with them with professionalism and competence, but also frankly and politely. In this regard, there are, for example, several hospitals that use different colours in the hallways and wards to help visitors find the correct path to follow. With reference to communication between the medical/nursing staff and patients, conversely, specialised courses on communication techniques are becoming increasingly popular and are especially effective in cases where there is a need to communicate with clarity and professionalism and even with discretion and tact. As for internal communication between members of the same system, it is now a known fact that it is at the core of the ‘best management’ of any organisation. Any system consisting of more than one individual needs to ensure the effective delivery of information between the different individuals that are a part of it. In this respect, an example of the procedures that some organisations have adopted to facilitate the exchange of information is to have the shifts of incoming and outgoing personnel slightly overlap. This favours the chances for colleagues to meet physically in short briefing sessions to exchange information on what happened during the previous shift. Some abnormal events, in fact, might have changed the system’s conditions compared to ordinary ones, and as such involve the need to communicate the changes to a colleague. The architectural design of the working environment should then promote the logic of briefing, providing an enabling environment where workers can organise such meetings.
Creativity is perhaps the true essence of resilient systems. Whenever an adverse event tends to destabilise a system, the latter’s resilience opposes failure by generating the necessary self-defence mechanisms necessary for its survival. This is the case for example of emergencies or situations that diverge from those imagined, for which, as we have seen already, the resource does not have the time or ability to apply the protocol as planned. In these cases, creativity, both personal and of the work team, has a role at the level of ‘knowledge’, leveraging their prior knowledge and its own intelligence to resolve situations not attributable to familiar contexts. It is important, however, that when these circumstances occur, we are able to capitalise on the information resulting from those circumstances, treasuring them and exploiting them if necessary to change the rules for the organisation to function. In the air force, for example, as well as in Formula 1 or motorcycle racing, at the end of each qualifying lap or race, drivers are used to meeting up to review together with the engineers, team managers or personal coach how the events unfolded, analysing them, exchanging opinions and information and ultimately learning new scenarios that may occur again in the future. A critical analysis of these events allows them to make the right ‘corrections’ to the system, capitalising the information received and encouraging its growth.
Humour and irony are two characteristics of resilient systems that refer to the individual more than the system as a whole. As previously mentioned, they are actions meant to enable a break-away from the problem by belittling it through an analysis that can lead to a positive re-evaluation of the accident. Being able to apply humour and irony to a problem is something that pertains to the psychological framework of each of us and refers to our ability to address the problem with a positive attitude, knowing the pitfalls, but without being overly intimidated by them. In borderline cases, breaking away from the problem results in what we call ‘cold blood’, that is, the person’s ability to exorcise the problem preventing it from overpowering him and thus throw him into a state of panic and as such, hindering his ability to react effectively. The right attitude to deal with situations that involve high emotional stress sometimes stems from innate personal aptitudes, but more often than not these skills can be learned, following the advice of more experienced staff and participating in training courses and simulations. Everybody knows about the widespread use of flight simulators in aviation, aids that are useful to teach the trainee pilot how to cope with unexpected situations, sometimes pushing him to the limits of what is solvable and getting him used to situations where it is necessary to maintain the proper concentration and cold blood. Even in the medical field, emergencies in the emergency room require staff on site possessing the same skills, both technical and in terms of aptitude. For this reason, even among medical staffs, in recent years, realistic simulation techniques have become more and more popular in the training of doctors and nurses. For example, the spread of sophisticated mannequins, able to emulate what could happen in real emergency situations, get used them to developing the ability to cope, with the required preparation and with all due ‘detachment’, to emergencies they will likely find themselves handling in their job.
It would also be desirable, although hardly feasible, that coordinators know how to identify within each organisation, the resources that are most predisposed to operate in situations involving high emotional stress. This is the case, for example, in the choice of emergency response personnel. Doing so would enhance the possibility of establishing, even within a same shift, a team of people who can work together, each supporting the other’s idiosyncrasies, as happens in operating units of the armed forces, where everyone has their own role determined by referring not only to the individual’s aspirations, but also to what are his skills and abilities to take action.
Ultimately, ethics represents that additional property of resilient systems of defining its own values and policies shared by all its members. Ethics ultimately is part of the concept of ‘corporate policy’, which has always been among the cornerstones of quality management systems that have nowadays become the standard. It recalls the need for the organisation to define, inform and especially share with all of the system’s members the principles that inspire and guide the business, both internally and within the social context in which it operates. The pivotal point of this principle is contained in the following question: how can we expect rules and procedures to be followed if those who are called to do so do not agree with the strategies and policies that have generated them?
The principle should be familiar to those who deal with security and prevention and its essence can be clearly explained by the following example. If we were called upon to ensure that the staffs of our operating unit wear a helmet, rather than fasten their seat belts or proceed according to our instructions, we could act by enforcing corporate security procedures requiring staffs to comply with these requirements. We also could define the penalties applicable for those who fail to perform as required, and ultimately, we could ensure efficient surveillance, checking that everyone behaves as prescribed.
All this would not be as effective in assuring us the expected results as being certain that our unit really agrees with the need to resort to the safeguards we have devised to avoid damaging repercussions for their own safety. This is what we refer to as a ‘safety culture’, i.e. endorsing certain convictions, certain principles, certain values of our own and, therefore, behaving accordingly, in accordance with the procedures given, not because they have been imposed on us, but because we are aware of their actual effectiveness.
5. Human error and the ability of the system to limit it: correlation between man-machine-environment
Man, as it turns out, is the main resource we need to focus on to improve the reliability of the organisation; through him, it evolves and becomes capable of coping with unexpected events, thus reducing the likelihood of making mistakes. It is, therefore, essential that man be put in the position to be able to operate at his best, taking advantage of his strengths and limiting the consequences of his weaknesses. Human error is commonly referred to as a ‘performance that deviates from a prescribed and specified sequence of actions’ . And properly handling human error is, after all, a difficult and delicate operation, because the whole operation of fully securing a system depends on it.
To understand how the system can limit the occurrence of human error, we must go back to its genesis, i.e. retrace the process that starts with man’s thought and leads to the execution of an action.
Jens Rasmussen in 1983 proposed a classification of human behaviour , in which he illustrates a model that is still used today as a basis for understanding the genesis of human errors and for accordingly implementing appropriate actions in trying to limit them.
According to Rasmussen, human behaviour can be broken down into three different types:
rule-based behaviour, and
‘Skill-based behaviours’ are automatic behaviours that the subject has already acquired and that he therefore has fully mastered. They include actions that man carries out without having to think or reason about them, nor pay particular attention to during their execution. For example, brushing our teeth, tying our shoes or pressing the brake of a car are all behaviours acquired and implemented automatically, which do not require special attention or concentration. In this case, an error in their execution, as indicated by J. Reason in his book Human Error, 1990  happens simply because of carelessness (slips) or forgetfulness (lapses). The person who performs the action automatically makes a mistake not because he does not know how to respond properly to the situation or cannot remember how to perform the task, but simply because he is momentarily distracted. No grown man and in full mental and physical capacity, in fact, needs to make it a point of remembering to fasten his shoes, let alone follow a particular procedure to remember how to do that. However, we may sometimes leave the house with our shoes untied.
‘Rule-based behaviours’, instead, are behaviours for which the person is acting according to a rule, whether it be a procedure, practice or regulation. He, therefore, focuses on carrying out actions in the way he has been instructed. The error in this case can be determined by lack of knowledge of the rule (‘mistakes’), which thus leads to a wrong action. For example, if a fault indicator of a plant is switched on but the operator does not remember or recognise its significance, he may not implement the proper corrective action. This error, attributable to man as its mere executor, however, is not to be confused with the error that might take place if we were to follow a procedure correctly and nonetheless end up with nefarious results. In this case, in fact, although the behaviour of man is still based on rules, it turned out to be wrong as it is the very procedure that is wrong. Therefore, the mistake will not affect the person who made it, but rather the person(s) who invented the rule.
Finally, ‘knowledge-based behaviours’ concern actions in which the subject is faced with situations completely or partially unknown to him and must implement predetermined choices to address them, relying on his knowledge of this kind of situation or, in other words, drawing from his cultural and personal experiences. This is the case of innovation or of unexpected events, situations in which, as already seen, only man’s ability to making choices that break the mode can be decisive and lead to success.
The consequences of these distinctions are numerous and it is therefore important that the organisation knows how to recognise them, in order to help its members execute actions in harmony with the system that surrounds them. We have repeatedly hinted that knowledge of the ‘rule’ by the worker is an effective weapon in preventing mistakes. The system designs procedures and work practices, planning their goals, implementing rules for their execution, the resources to be used, the timing of implementation and identifying and correcting possible deficiencies over time. Therefore, assuming that the system has generated a correct procedure and assuming that no unforeseen changes have occurred to change its validity or jeopardise its results, a worker properly trained on how to execute it should not make mistakes. However, we know that in reality, this is not always true. There are multiple causes to errors, but one of the most common lies in the worker shifting in attitude from a ‘rule-based’ to a ‘skill-based’ behaviour. As time progresses, especially for those routine actions that the worker carries out automatically and without having to think about them that much, his threshold of attention in tackling them lowers, leaving room for possible distractions that could induce him to make the mistake. When it comes to driving a car, for example, we carry out a series of routine actions in sequence, like closing the doors, putting the transmission into neutral or engaging the clutch, turning the key, or disengaging the parking brake. No one, except for novice drivers, runs over what he has to do or how to do it before carrying out the task, but carries it out automatically. Sometimes, though, we will start the car without disengaging the brake or without closing the door.
How can the system limit these distractions?
The answers are varied and are found in how the human brain works. It is well known, in fact, that the brain cannot process all the information surrounding it with the same attention span, and to avoid being overwhelmed, it tends to ignore the information it believes is unnecessary or of secondary importance. Routine behaviours, precisely because they demand less concentration of the brain, run the risk of receiving less attention, especially if an external stimulus, intended as a distracting event, is introduced and ‘seizes’ the brain’s attention. After all, if we think about it, the human brain needs to feel stimulated. Contrary to what happens, for example, in the CPU of a computer, which can work even for hours at very low utilisation rates, the human brain tends to always work within an optimum range of ‘engagement’. If we overload it with information, it will get tired and will tend to lose track of some of the information, but on the contrary, if we do not engage it enough, it will look for other stimuli, focusing the attention elsewhere and trying to process other information from the surrounding environment. If the other information it finds does not stimulate it enough, the brain will draw from its own memory and emotions, ultimately generating those mental processes that we call thoughts, or more correctly refer to as overthinking, understood as mental processes that stray from the context and that are distracting.
The strategy, therefore, is to help man maintain a high level of attention on what he is doing, trying to not overload him with information that might wear him out, but at the same time preventing that the information become ordinary, dull, or uninspiring, with the risk that the brain lowers the level of concentration, ultimately leaving room for distractions. Workplaces, machines or infrastructure should therefore be designed to require that the resources that use them keep a level of attention above the minimal threshold that leads to their distraction. An example is the design of highways, in which the tendency has long been, whenever possible, to avoid long stretches of road on a straight path. In fact, such straight and long stretches make driving monotonous and thus are likely to distract drivers, including gentle curves, although not necessary for purely technical reasons, engages the brain more actively, forcing it to concentrate on the road.
Very popular are also cases where the worker is forced to act in accordance with ‘rule-based’ behaviours, thus avoiding the risk of the tasks becoming routine and instinctive at the root. Such is the case, for instance, of the growing use of technical checklists like those used in aviation before take-off. The peremptory inspection of all piloting and emergency devices executed in accordance with a list of points addressing the tasks that need to be performed undoubtedly lowers the likelihood of errors, which in this case are due not only to possible distractions but especially to forgetfulness. In this regard, however, there is a need to be cautious and to resort to the use of the checklist only when they are actually necessary, limiting the list of steps to follow. A disproportionate use of checklists beyond what are the real needs would lead man to consider them overly long, boring and probably of little use, and therefore not to apply them with due care and thus reducing their positive effects.
In other cases, the working environment or the machine is designed to ‘physically impose’ to the person who works there (or who operates the machine) that he act in compliance with the safety standards designed by the system, preventing him from making mistakes. One example is the safety lever on hydraulic excavators that disables movement of the excavation arm. It is located on the side of the driver’s seat, forcing the worker to raise it physically in order to climb out of the machine. Or the drive control of some industrial presses that is twofold and installed at a distance, which forces the operator to use both hands to operate the press, forcing him to keep them away from the danger zone.
Sometimes the mistakes man makes due to distractions involve involuntary movements, such as bumping into an object or inadvertently operating a command. In the case of the use of machinery and work equipment, a standard practice to avoid such mistakes is to design controls with ‘double consent’. These controls require that the worker carry out at least two actions in sequence in order to operate the device, thus avoiding involuntary actions.
The above examples help us understand just how crucial it is that the whole production system, starting with the design of the working environment, as well as the machinery and equipment installed within it, takes into due account the needs of man, in order to help him carry out the actions without leading him into error. The working environment accordingly is moulded around the needs of the individual, promoting (as we have mentioned) his strengths and minimising his weaknesses. Knowing the causes that lead to human error is also crucial when planning the best corrective action. As we have seen, one of the key prerequisites for improving the corporate safety is to educate and inform employees on performing the right action, in other words, they should always know ‘what to do’ and ‘how to do’. But in cases where, for example, an expert nurse needs to perform a blood test without having previously disinfected the patient’s skin, it would be entirely useless to impose that she attends another training course. Her behaviour in performing the blood test, in fact, is ‘skill-based’, so that only her inattention or forgetfulness may have caused the failure, and certainly not a lack of knowledge of the rule.
Errors due to ‘violations’, instead, i.e. where man purposely carries out a wrong action, breaking the rules, are a whole different subject. Violations essentially can be broken down into: ‘routine’, ‘necessary’, or ‘due to wilful misconduct’ . Aside from these, which are intentional, that is, made by a person who is fully aware of their outcomes, which is sometimes disastrous, and that involve deliberately acting against the rules, ‘routine’ violations are known both to workers and supervisors/managers and are normally ‘accepted’ by the system, as they are sometimes deemed useful to optimise/accelerate the execution of an action. The violation is also not viewed as possibly impairing the system’s operation or as really increasing the risk to the worker. A routine violation, in fact, is always linked to its perpetrator (more or less) underestimating its consequences, both in terms of the probability of its occurrence and of the damage. A routine violation is a ‘short cut’, a way to simplify a rule which is seen as unnecessary, because it is considered too complex to be respected/redundant or overly cautionary to be complied with in full. Removing a crankcase from a machine just for speeding up work operations, not using personal protective equipment against falls from a height considering the occurrence unlikely, or not wearing a seatbelt only because uncomfortable, are some of the many examples that lead man to expose himself to the risk, breaking the safety rules.
One of the best strategies applied to avoid these violations is that of ‘setting the example’, especially by employees who have a role of responsibility in the company. Man, in fact, tends to emulate others and modifies his behaviour based on the example set by others, be it positive or negative. In other words, he acts in tune with what is considered the standard practice within the system, often regardless of the behaviour expected of a rule. If, for instance, we were to transfer some workers operating in a virtuous system and who diligently follow the rules to a similar system that is not as virtuous and where transgression, instead, is the norm, many of them after a certain time would begin to conform to the new system, breaking the rules. The system should then set the example, encouraging the appropriate behavior of its resources, also through the use of rewards for proper conduct and not merely punishing transgressors. In this way, each of the system’s components will receive fair recognition for his work and will be driven to apply its virtuous behavior, advancing the entire organisation.
The urge to break a rule not only stems from considering it overly cautious or not fully agreeing with its purpose, but also often stems from benefits (sometimes even financial) for the company or, as we have already seen, for some stakeholders within the system. However, we need to consider that the ‘normalisation of deviance’—the process that generates a steady erosion of standard procedures, where minor violations and irregularities are accepted and tolerated, in the long run could prove destructive to the system’s integrity. In the absence of accidents, in fact, these deviations will become ‘normalised’ and part of the routine. This will lead to people within the system slipping into danger without being fully aware of it.
Such is the case, for instance, of ‘reverential bows’, which have become famous after the tragic events that involved the cruise ship Costa Concordia, which on 13 January 2012, helmed by Captain Francesco Schettino, while preparing to perform a flyby along the coast, hit a reef off the island of Giglio and flipped over on its side, causing the deaths of 32 people . I do not want to make direct references to the case in question, in which the trial ended with the conviction of Captain Schettino, whom the Court judged to be the sole person responsible for the heinous choice to perform a dangerous manoeuvre for selfish reasons alone. However, multiple sources report that this ill-considered action was not the only one of its kind in the history of navigation, but on the contrary, as many witnesses confirm, would seem to be a widespread practice. The reasons that may lead a ship captain to make a ‘bow’ are not exclusive to a particular type of person. For some, the motivation is strictly personal, i.e. the desire to sail near the places where he was born or grew up or where his loved ones live. For others, it is simply the taste of doing something daring, while some people suggest that economic interests spur some ship captains, for whom the majestic view on a cruise ship just a few metres from the shore offers considerable publicity. The fact is that one or more parts of the system are driven to make a potentially risky manoeuvre, reassured in their choices by what we already called ‘normalisation of deviance’. The management thought that transgressing the rule was possible and that the outcome of their actions was well worth, determining only a little increase of the risk, considered wrongly to be minimal and controllable. The story of the Costa Concordia, instead, sadly reminds us that the often positive outcomes of risky behaviours should not lead us to consider them as legitimate. The fact that we often avoided disaster must not lead to an acceptance of the risk. Approving risky behaviours to pursue personal agendas, in fact, produces countless negative effects: ‘it damages the culture of safety; it stretches the boundaries of the risk; it increases the tolerance of errors that do not cause damage; it increases the level of acceptance of risks in favour of interests tied with efficiency and productivity; it ultimately leads the safety system down a slippery slope in which accidents are increasingly possible’ .
Ultimately, ‘necessary’ violations, as we have seen, involve a deliberate and motivated steering away from a well-devised procedure because the actual conditions make it impossible to follow or lead us, after meticulous thought, to prefer different solutions. Apart from the case of the Apollo mission that we discussed earlier, in which the head of the mission authorised the use of propulsion systems to return to Earth against protocol because in the given state of emergency, this solution was the only means of salvation; if we again consider the case of the Costa Concordia, it is important to note how the manoeuvre made by Captain Schettino immediately after the disaster, which was to dock the ship on a rocky reef just North of the port of Giglio, although such a manoeuvre was not contemplated in the emergency procedures, ended up facilitating rescue operations, avoiding to have to perform them in open sea, with all the consequences that would have involved. The impressive gash on the left side of the ship, in fact, would have caused it to sink, considering that as many as five compartments had filled up with water (while protocol established three at the most). ‘Necessary’ violations, in short, occur whenever an unexpected event occurs, i.e. in those conditions where time, the context’s conditions and compelling necessity impose a breach of protocol, which, therefore, in that moment is seen as inadequate, leading us to prefer unplanned actions that are the fruit of the ‘knowledge’ of the person who carries them out. These violations represent a ‘treasure’ that must be capitalised by the organisation, whether or not they lead to the mission’s success. Every time that ‘necessary’ violations occur, the organisation must ask itself how it got to that point, what did not work, what it failed to forecast, and what did not go according to plan. Consequently, one or more briefings after the event will allow it to evaluate the pros and cons of the actions that broke the rules and will ensure that the organisation can learn from the situation by improving and implementing its own rules and procedures, and as such, displaying that proactive behaviour that is typical of resilient systems.
6. Error reporting incentives: the importance of communication and information-sharing at all levels
The currently most credited method for improving the level of safety in an organisation is to push human resources to encourage error reporting as the only way to reduce those latent factors that, if not corrected early on, can lead to serious consequences for the system, causing potential accidents.
This perspective, as it turns out, is based on the idea that accidents are seldom caused by a single event, be it technological or generated by human error, but more often by a chain of events related to past failures inherent in the system, designed and built by man, and therefore in itself just as fallible.
In aviation, it is estimated that for every serious accident or catastrophically damaging event, no less than 30 inconveniences of medium importance occur, and these correspond to no less than 600 minor inconveniences . Therefore, since the catastrophic accident always originates from a broad base of malfunctions and minor inconveniences, if we could provide the system with a tool capable of having a sizeable impact on basic malfunctions, we could drastically reduce the overall number of catastrophic events. It is vital therefore that anyone at any level and whatever his role within the organisation is committed to pursuing an active participation on flagging any malfunctions or errors discovered in their job, not underestimating the consequences that they can generate in the dynamics of events leading up to the failure.
Latent errors, understood as malfunctioning technology, are often overlooked, especially when the error has not produced harmful consequences for the system. There are several reasons why this happens, but they all involve underestimating the possible consequences of the malfunction, both in terms of the probability that it can generate a loss and the magnitude of that loss. There is also often a generalised ‘laziness’ on the part of the system’s members in reporting what they know, especially if they were not directly involved with what happened, thus preventing the organisation to solve the problem.
Therefore, sometimes minor malfunctions due to wear and tear, flawed materials, or lack of experience in controls or repairs, such as a broken mirror, a worn rope, a burnt-out LED or a protective cover that was not reassembled on the machine, are not given the due attention, leaving minor latent and potentially harmful bugs within the system. Man, on top of it all, has the extraordinary, but in this case, has counterproductive ability to adapt quickly to mutations in the surrounding environment. After a while, malfunctions, bugs, or errors are no longer being perceived as potentially dangerous: we get used to them, we no longer notice them. However, it remains inherent in the system, ready to take part in the chain of events that could lead to a failure.
Experience also shows, especially in the case of human errors, that they are often concealed by the person who committed them, or by people who are aware of another person’s errors, in a logic of avoiding negative repercussions or sanctions.
Unfortunately, faulty actions, understood as potentially dangerous mistakes made by man and that have harmful repercussions for the system, are often dealt with by applying only a logic of punishing the action to solve the problem. The organisation seeks to single-out the culprit, as he is considered the only person responsible for what happened, as the last actor in the process or action that led to the error. The search for the culprit also triggers a natural defensive stance by the person who ‘made the mistake’, who will tend to hide the actions for fear of being punished. In this culture of blame, someone who says, ‘I was wrong’, is as if he were admitting that ‘I am wrong’ , essentially claiming to be allegedly unfit to be part of a system in which he sees himself as a possible cause of failure.
The person guilty of making the mistake consequently has a poorly constructive attitude or one that is not constructive at all in resolving the issue, because he feels responsible and vulnerable, fearing the loss of esteem by his colleagues. He thinks he has betrayed the organization’s trust, so he will do everything he can for hiding his responsibility, hampering the analysis of what happened and consequently the possibility for the system to solve the problem. The culture of isolation and punishment of culprits, therefore, does not generate those organisational defences that help solve the failure, but instead encourages the adoption of defensive behaviours that are often inappropriate.
In this regard, the official report by Edgar Cortright on the causes of the explosion of the oxygen tank during the Apollo 13 mission, we discussed earlier, made it clear that it was generated by an impact that the tank had suffered 2 years before when performing certain maintenance tasks. As a matter of fact, the report specifies that the inconvenience had been duly reported in the maintenance records, but that it had been judged to be of minor importance. The records, in fact, referred to a ‘slight collision’, playing down the consequences that it might have on the proper functioning of the tank.
It is unclear whether the collision was indeed ‘slight’ nor as to what was the basis for judging it as such, but the fact is that it is certain that it was the origin of the malfunction, having caused damage inside the tank. It is therefore legitimate to ask whether a more alarming report than one downplaying the seriousness of the inconvenience by maintenance staff as to what had really happened would have alerted more the engineers in charge of the operating unit, inducing them to further inspect the possible damage suffered by the tank. The dynamics of the events suggest that the staff assigned to maintenance or to transport the tank, feeling somehow responsible for the mistake, acted in the logic of downplaying what happened for fear of negative repercussions from the organisation.
Even in the previously mentioned case of the Costa Concordia ship, the dynamics of the events could have had less serious consequences if communication errors had not been made when handling the emergency. Captain Schettino repeatedly pointed out during the trial that, following the collision with the rock and the order that followed to shut the watertight compartments, nobody had been able to tell him exactly how many of them were flooded. The investigation later clarified that the flooded compartments were five, while recordings of the conversations between the captain and the crew members responsible for their control stated only three. The difference, as we have already seen, is not trivial, because with only three compartments flooded, the ship would not have sunk.
During the investigations carried out by the magistracy, it was later revealed that one of the crew members, during a wiretap following the accident, told a friend that he realised that water was leaking from one of the hatches he was supposed to monitor. This important piece of information, for reasons that have not been clarified yet, did not reach the bridge, causing a delay in the implementation of the evacuation procedures.
The dissemination of knowledge at all levels, though, is not important only in an emergency or when a malfunction or unusual circumstance is uncovered; it is an effective weapon also in reducing injuries during the system’s normal operation. On the other hand, this is something that is well-known in the air force, where the systematic exchange of information and flight experiences by organising regular meetings at the end of a drill involving not only the pilots directly participating in the mission, but also all the other squad members or at times, even members of other units, significantly contributes to the dissemination of knowledge. This allows the system to learn by itself based on the mistakes made, consequently fostering the development of that new wealth of knowledge that enables it to self-implement and become more efficient and safer.
Safety management in highly complex socio-technical systems cannot be narrowed down to merely comply with corporate procedures and regulations, assuming they are well-structured, or even regular inspections and maintenance on plants, machinery and equipment. It is rather the result of a synergy of all the system’s components, both human and technological, whose interaction, based on shared process objectives and strategies, is an essential prerequisite to ensure its success. Every element that can play a part at any level in creating the risk must be taken into account, starting from the human resources.
It is therefore necessary that the working environment is designed to meet the needs of man, bearing in mind his weaknesses or carelessness, fatigue, or stress, factors that can lead him to make a mistake. But at the same time, he must be put in the conditions of exploiting his strengths, stimulating his creative and decision-making qualities in implementing the system and managing accidents. The latter, moreover, are always present in highly complex systems, which evolve and change their conditions because they driven to innovation and progress.
Therefore, it is necessary that the organisation can recognise them and correct them promptly, especially when it comes to small anomalies that have not yet demonstrated their potential for damage, capitalising on the information it has received and at times, managing to improve the rules and considering new scenarios.