Mission critical systems (MCS) are complex nested hierarchies of systems, subsystems and components with defined purpose, characteristics, boundaries and interfaces, working in harmony to deliver vital organisational functionalities. Upgrading MCS performance is inevitable when capability enhancement is required or new technologies emerge. Improving MCS however is considered with certain degrees of reluctance due to their sensitive role in organisations and the potential disruptive impact of unexpected consequences of change. Innovation in MCS often appears in small steps that affect the entire system due to their highly interdependent structures. Effective management of innovation introduction in complex systems require systemic/systematic processes that involve process management and collective analysis, scoping, decision-making and R&D which relies on effective information sharing. This approach should run throughout the system and must include all aspects and stakeholders, utilising the skills and knowledge of all involved. This chapter describes the basic concepts and potential approaches that could be utilised to build intelligent systemic/systematic and collaborative environments for MCS innovation. Advances in ICT technologies provide an opportunity to access the wider sphere of knowledge and support the systemic innovation processes. Adopting systemic approaches increases process efficacy, leading to more reliable solutions, shorter development lead times and reduced costs.
- machine learning
Mission critical systems (MCS) are systems whose performance is fundamental to continued operation or even survival of businesses or organisations. Failure of MCS can have catastrophic consequences for the businesses/organisations and their clients. Examples of MCS can be taken from a wide spectrum of systems: from sensitive defence systems, public services such as utilities to those supporting banking infrastructure and financial transactions. Even systems that facilitate smooth operation of many small businesses such as taxi companies are MCS where their integrity affects the livelihood of business owners and employees and is instrumental in providing satisfactory service to their clients.
A significant characteristic of MCS is their reliability and resilience which is necessary due to their critical role in operational integrity. Other attributes considered in the design of MCS is modularity and redundancy. Hazard/crisis/disaster mitigation and recovery are also common in MCS. Another important factor is cost. By nature, such systems tend to be complicated interrelated structures where time, effort and money have to be spent in the validation of their compliance which can introduce heavy burdens during their development and testing.
Advances in technology present new opportunities to upgrade and modernise every system and solution including MCS. Whilst innovation tends to find its way into every man-made system, the penetration rate of new technology and new innovation is much lower when it comes to MCS. This is primarily due to the time and effort required for their validation and assurance of operational reliability. The common wisdom seems to be in favour of relying on older proven technology than taking on new developments and its associated costs and risk of unexpected failure that may arise when introducing new systems. Almost all existing MCS solutions utilise computers, both hardware and software, where complete validation of response in all circumstances is extremely difficult if not impossible. This is not just a theoretical concern but borne from realities with many examples to prove the case.
The Royal Bank of Scotland’s (RBS) systems failure in 2012 which resulted from upgrading the payment processing software and the more recent failure of Visa card processing in 2018 which resulted from the partial hardware failure in one of the switches in their data centre are clear examples of why there is a reluctance in upgrading MCS. This problem is not limited to banks and is applicable to many other areas. The recent report in 2017 that the HMS Queen Elizabeth, Britain’s largest ever warship, relied on Windows XP for some of its functionalities highlights the same underlying concerns that has led to a situation where nearly two decades after the retirement of the XP platform by Microsoft, it is still operationally utilised in a highly sensitive defence platform.
The occurrence of such failures, although may be used by some as the reason to prolong reliance on older technology, is a clear example as why it is necessary to address this issue. The fact that RBS and Visa (and many other examples like it) found it necessary to introduce new innovative upgrades to their system is a proof that upgrading and introduction of new technology is inevitable. By the same token, the outcome highlights the potential consequences of getting it wrong. In the case of the HMS Queen Elizabeth, cyber security threats and the vulnerabilities of the Microsoft Windows XP system have already raised concerns about the wisdom of its continued utilisation. It is likely that this situation is subjected to reviews which could result in its retirement (if it has not already happened).
This clearly demonstrates that it is not the change that is problematic but the approach to change. Decisions about change to any system especially the MCS should not be taken lightly for the reasons highlighted above. This is a management decision that must determine the time and process for introduction of new innovation.
Different organisations have different strategies to deal with this problem whether they are active in the development of MCS or not. These strategies are influenced by two important factors of criticality of the system and the significance of innovation in the organisation’s prevalent culture.
At the lowest level, when system performance begins to lag, it is usually the indication of the need to change and time to consider new innovative elements/solutions to maintain the system’s relevance. This often means that change is becoming inevitable and has to be seriously considered. Delays in facing such issues could have serious consequences for the organisation.
Another possible indicator of the need for change is the technology backdrop and emergence of new technologies. For example, when smartphone manufacturers start releasing 5G mobile technology, it is no longer viable for service providers to drag their heels and rely on satisfactorily performing 4G platforms. Customer demands will eventually make its impact and customers vote with their feet, if the new solution is not introduced.
Progressive innovating companies often have innovation departments who are actively involved in developing new innovative technologies relevant to their business whilst scanning the horizon for any new development that can be applied to their business. Some even afford their staff free time to pursue their innovative ideas that may not even be related to their sphere of work.
It is important to highlight the importance of being proactive in search for and introduction of new innovation in all systems including MCS applications. This is likely to reduce cost and maintain control well before systems become obsolete.
Once the time for change has been established, capability upgrades and application of innovation in its realisation must be handled with great care and with consideration of the likely impact, consequence and costs of potential changes. Consideration should be given to all areas, especially technology capability and maturity, and applicable to all changes from small steps (localised improvements) or overall system capability enhancements through system overhaul.
Organisations’ management philosophy and strategy is usually set according to vision of its founders. At the same time, this philosophy has a direct relationship with the application area and the market place for its products and services.
2. MCS design and development
Design and development of MCS require much more stringent levels of project management compared to their noncritical applications. Key considerations in MCS design, development and innovation have the following characteristics:
Strong process control through effective leadership is a necessity when it comes to development and successful delivery of MCS.
Clear definition of objectives is key in developments of MCS applications. This should define every aspect of the project from scoping, requirement planning, capability provision as well as business objectives that includes budgeting and delivery schedules.
2.3 System architecture/construction plan
Suitable system architecture that considers modularity in design is particularly important in MCS as it allows small step/localised innovation. This is because most innovations in critical infrastructure are introduced in small steps.
2.4 Availability and redundancy
MCS applications by definition need to be available and, in many cases, need to have 100% uptime. When criticality levels demand, systems must be designed with redundant elements to ensure uninterrupted service availability. Firms are constantly trying to improve the availability of their critical services, with many targeting ‘five nines’ uptime (i.e. 5.26 minutes downtime per year).
MCS application must safeguard its users against failure. Failure may be due to system design/performance, changes in operation, human error and information integrity or malicious interventions. A detailed assessment of all potential pitfalls to ensure resilience must be covered at the earliest possible stage in the design process. Highly resilient systems are usually designed without any single point of failures (SPOFs). In MCSs with no SPOFs, a failure of a module, system component or site will not halt the entire operational function. Achieving such levels of resiliency often requires a relatively large investment of time and effort in the design phase of the project.
2.6 Disaster mitigation/recovery planning
Despite all the hard work put into design and development of MCS, on rare occasions system failure can occur. Clear investigation of risks and structured planning ensures a clear vision about potential risks and their mitigation. Consideration and implementation of disaster response, either automatic or manual, through clear procedures for dealing with unexpected circumstances is a key requirement. Such considerations must be catered for during the design stage.
2.7 Transition state
Another factor that must be carefully considered and managed, especially when upgrading systems through innovation, is management of the implementation process and its likely impact on system availability and performance. This may be the main barrier that affects some of the more frequently utilised approaches to system improvements and upgrades.
With this brief introduction, it is not difficult to conclude that intelligent systemic/systematic innovation in MCS is essentially a management problem with technical dimensions. This process consists of two key constituent strategic elements: a specific process for introduction of innovation and change including identification of the right components for change and a mechanism for choosing the right time for its introduction. Such strategies are often based on balancing clients/market needs, demands, expectations and/or trends with the technology horizon from one hand and commercial priorities for the business on the other. No doubt, the same level of scrutiny required in design and development of MCS is also applicable to its upgrade and initiation of new innovation.
3. Intelligent systemic/systematic innovation in MCSs
At this stage it is necessary to mention that although the scope of this review is expressed as a general guide to field practitioners, the increased frequency and widespread application of computers in modern MCS solutions has skewed this bias. Reliance on computer hardware and software in critical management and control has shifted the focus onto MCS solutions that rely on computers as a key element of their design and composition. This covers almost all contemporary MCS systems that control and manage present-day critical application areas.
Furthermore, the proposed process is not suggested as a replacement for the current knowledge, expertise and practice in design of the mission critical systems but an extra supplement to be utilised by field practitioners to support early introduction of new technology innovation in the existing MCS applications.
It was demonstrated that when considering introduction of changes and upgrades, especially concerning MCS solutions, what is not in question is the inevitability of system enhancement/innovation but the timing and the approach to it. The answers to questions of ‘When is it time to heed to demands for improved services?’ or ‘Until when will the existing arrangements remain viable?’ or ‘When is the current system no longer viable or serve their intended purpose?’ are at the heart of decision-making process about the timing of introduction of innovation and therefore essential to be answered. Furthermore, even when the need to change is established, there remains another question as to how this improvement should be best conducted.
One of the most relevant tools created that can answer such questions and help achieve objectives of MCS system designers is the systems theory and its branches of systems’ thinking and engineering. Systemic and systematic approaches to development of innovative solutions have been utilised in many areas, providing structured paths for creation of new solutions especially when the objectives relate to large and complex multidisciplinary projects. Systematic approach demands a disciplined process and introduces organised development roadmaps. It primarily focuses attention onto key objectives and considers their delivery through an assured path. Systemic approach however guides the process through detailed and exhaustive strategies that ensure all eventualities and circumstances are covered to enhance confidence in delivering reliability in performance and operational resilience. Whilst most if not all developments follow the systematic path, all MCS should adopt the systemic approach due to their sensitive nature.
The thesis followed in the presentation of the arguments of this chapter is to address the above two critical questions and propose new approaches that could be utilised in innovation and functionality enhancement in MCS.
Systems approach to new development is well established and covers every aspect of projects, ranging from prospecting, scoping, planning, design, testing, evaluation, etc. What has not been sufficiently considered in the relevant literature is establishment of a mechanism to signal the potential opportunity or time for change based on market (demand) and technology trends. Decisions about potential directions/choice of new emerging technologies for implementation, target system elements as well as appropriate timing are amongst important questions whose answer could put an end to overreliance on old technologies and deprive users of MCS from reliable and up-to-date service.
Another management dilemma is the approach required to embed innovation within the systematic and systemic MCS design process. Creating a framework that carves a separate track for systemic innovation as part of the design process should create a vehicle for delivering much needed progress.
The hypothesis and the proposed solution explained in this chapter is about developing new systemic methods that can help in a more regular development and adoption of system enhancements leading to continued performance of MCS solutions in line with new technology advances. The proposed approach has two key aspects: first, a collaborative development environment built on systemic innovation principles and next, deployment of artificial intelligence (AI) in the process and creation of intelligent agents that can support users’ and developers’ objectives.
4. Artificial intelligence: a systemic support tool
In view of the proposed inclusion of AI in this approach as a supporting tool, it is necessary to highlight few points for clarity.
Whilst application of AI and its capabilities is a proven reality, AI is somewhat controversial. A recent report broadcast by BBC  that reported a machine capable of accurately predicting the decisions of the European Court of Justice 79% of the time leaves little doubt about the potential capabilities of intelligent machines.
The capabilities of applying AI in utilising available data-generated social networks in political manipulation have already been established and roundly condemned for its potential abuse. Whilst no one yet suggests that judges in European court should be replaced by computers anytime soon, the ability of AI and machine learning in support of making quick decisions in times of crisis is well established. The possibility of analysing data and literature to locate hard-to-find information and intelligent systems’ potential in analysing multiple scenarios, predicting the likely outcomes and the degree of confidence in predicted results, are capabilities that can be taken advantage of, as part of the MCS development and its life cycle.
4.1 AI in MCSs
MCSs by nature are complex systems with multiple internal and external dependencies. Different parts of the system (both primaries and secondaries) are often designed and developed by multiple external vendors. In general, the focus of MCS designers is not on cost but often on preserving life, nature or the business [2, 3]. Rigorous recovery requirements are imposed on the system as future existence may be at stake in case of delayed or incomplete recovery.
Geographically dispersed teams often contribute to the MCS project. Documentation and user manuals can be in multiple languages and styles. Many legacy MCSs often lack proper documentation and disaster recovery plans. They might use obsolete product/software with limited or no third-party support and maintenance. Managing all these complexities in any noncritical system is proven to be challenging to say the least. The challenge, however, would be even greater when dealing with multiple MCSs.
In MCS, agility in response and service uptime are the key constraints . The system has a very concise and clear set of requirements . The system should always act in deterministic fashion based on the requirements and nothing more. The problem occurs when stochastic bottlenecks disturb the normal operation of the system. This introduction of chaos into an orderly operation of the system requires immediate attention and response. Similar to an open-heart surgery, one would not be able to shut down the entire system (the patient’s heart in this case) in order to fix a problem. Instead, the system needs to be maintained, fixed or replaced with minimum downtime (~5 minutes a year in 99.999% SLAs) or sometime without any downtime at all.
During the past 60 years, many frameworks [6, 7, 8], procedures [4, 9] and systems  were created to support designing  and managing the complexities of the MCSs. These efforts have had major effects on improving the three aspects of MCSs: reliability, resiliency and recovery. In the next section, the authors explore three aspects of MCSs which can benefit from AI and machine learning algorithms which have previously gained less attention in the literature.
4.2 AI for MCSs rapid adaptation to risk and immediate response
Assume that an MCS system, SystemX, is responsible for orchestrating a series of autonomous delivery vehicles and road infrastructure. A major failure occurs in the system on Monday morning around 02:00 AM. The monitoring systems failed to alert the shift staff in the control centre. Calls were made to the customer facing team with reports of autonomous vehicle failures. The customer facing team aggregated the data and once a certain threshold was met escalated the call to the technical team. The technical team congregated the extra information by checking the logs and sending out a field engineer to the geographical locations with the reports of failure. The team managed to revive the system by 9:30 AM on Monday morning. By 10:30 AM, official press release was published on the company’s website and social network platforms with minimal information about the actual root causes of the problem. The latter, simply because such information was not available at the time. Social media, however, started an online outrage with scandalous reports. This has resulted in decline of the company’s share prices when the stock markets opened later on that afternoon. There is also evidence of damage to the company’s reputation/brand. The company has managed to deal with the technical problem and reinstate the services based on their well-structured disaster recovery plans. The technical team addressed the issue in the most efficient and effective manner. After a few days, it emerged that the failure was due to a planned software upgrade of a noncritical component of the MCS. The senior managers dealt with the public side of the issue making sure that the end-user’s expectations were managed properly and any potential consequences were mitigated. Two important questions come to mind: (1) What are the unexpected consequences of the failure? (2) Could the company have prevented the nontechnical consequences of the failure or at least responded to them more appropriately to reduce the overall damage?
The current disaster recovery plans are mostly designed to deal with the problem at hand in the shortest amount of time. This makes sense as during the time of disaster, the highest priority should be dedicated to save lives/nature, minimise the damage and restore the operation completely. However, service interruptions often come with a series of expected and unexpected consequences. The real consequences of an event are hard to predict as there are many socioeconomic factors involved. They will often manifest in the form of loss of customers, reputation, share prices or general trust in the brand/services. In the worst-case scenarios, the failure may be detrimental to public safety in the years to come (e.g. the BP Gulf of Mexico oil spill in 2010). Rebuilding trust in the service/system and rectifying secondary issues, although possible, is a costly exercise and can potentially take months/years. AI and machine learning can be utilised in such scenarios to minimise the consequences.
Most of the well-designed MCSs come with an extensive set of monitoring and alert systems . They are designed to gather data from a series of sources such as physical sensors, software/application activities, public resources and user feedback. The data is gathered, aggregated and presented to the technical teams. They will then act on the presented data reactively. This is useful to make sure that the disaster/failure is captured and fixed as soon as possible. This is sufficient to deal with the problem at hand and will also extend to rectifying any expected chain of consequences (i.e. taking care of external connected services, compensation and recovery, etc.). It is evident that the process is very much reactive.
The real value of the MCS related data can be unleashed using machine learning algorithms [13, 14]. Prediction and anomaly detection algorithms can run silently in the background going through millions of lines of sensory data. They can also go through the public information/census on the MCS of the interest . The algorithms are capable of predicting how markets or the public would react to a specific event/disaster related to the target MCS. They can also outline the potential unexpected consequences of a certain event by looking at historical data. Such algorithms will be able to provide timely recommendations on what needs to happen next in the very early/crucial period of incident also known as golden minutes. During this time all efforts are focused on resolving the problem at hand. Going back to our earlier scenario, SystemX, an intelligent system would be able to conduct the following tasks while the technical team are busy fixing the problem:
Predict the time that it would take for an event to trend on social media and publish proactive notifications.
Predict the changes in the stock market value of the company so precautions can be made to minimise damage.
Predict the public reaction based on similar types of failures in the past.
Predict the potential chain of events based on previous evidence so they can be prevented earlier.
This intelligent tool would be an extension to the existing processes to enable rapid response to failures and early mitigation of the future risks.
4.3 AI, MCSs and critical regions (CR)
There are certain components/areas in the MCS that are categorised as a critical region. A CR is the beating heart of the MCS. Similar to SPOFs, its failure is highly likely to result in a major disruption in the whole system. A single MCS is comprised of multiple CRs. The CRs are often indicated and documented during the design phase. They are closely monitored at all times by sensors or human/software examination. They are maintained carefully and replaced on regular basis. As the system evolves, it becomes harder to identify or track new CRs. Every system in its life cycle goes through extensions, replacements and overhauls. During such processes new/undetected CR may occur. Machine learning algorithms can analyse historical logs to identify minor failures in the system and investigate the overall impact of the failure on the entire system. The AI-enabled tool can eventually recommend new CRs in the system that might have been unnoticed in the past. Referring to our SystemX example, the incident may have been preventable if the system upgrade had already been identified and flagged as a CR.
4.4 AI, MCS and lessons learned
Once failures are dealt with and resolved, teams often document what, where and when things went wrong, the underlying causes and the lessons learned. Learning from failure is the key factor in making sure that similar issues will not occur in the future. Many industries work with the policy of transparency and no-blame culture to make sure that entities share their failures so that others can learn from their experiences. One example is the aviation industry in which airlines are obliged to report failures and incidents to prevent them from happening again.
Extracting knowledge from previous incidents is a convoluted process and often touches only the surface of the issues. In our SystemX scenario, the incident may have been preventable if previous upgrade-related issues were flagged and described in fine detail to the technical teams. Online communities and forums are overwhelmed by description of member experiences of various problems, issues or system failures and mitigating advice based on member expertise and experience. It is hard to find an issue which has not been experienced by someone else in another related or unrelated field. It is, however, an impossible job for a human to aggregate the available public information before conducting a task. Machine learning algorithms such as deep learning can help. Deep learning is a machine learning technique that does what comes naturally to humans: learn by example. They can find relationships between independent information trying to find patterns of interest. Deep learning algorithms can go through the private and public incident reports to reveal valuable information which is hidden from unsuspecting human eyes. In some cases, it can even exceed expert-level performance. Let us revisit our SystemX example; the intelligent tool could have been consulted prior to the software upgrade to identify if any failures occurred during performing a structurally similar task, albeit in a different industrial domain, sometimes in the past, highlighting the underlying causes and consequences of its occurrence. This could have surely been of value to the management and technical teams in charge of planning modifications or upgrades.
What worries scholars and the public is the prospect of machines making decisions that are usually taken by humans which requires application of morality and ethical standards. It is this aspect that creates ethical dilemmas and the moral conundrums, to the extent that leading philosophers and thinkers, no less than the late Stephen Hawking, have raised concerns and recommended caution.
The intelligent algorithms do not need to replace humans but can in fact go hand in hand with them to extend human capabilities. This is particularly valuable in the case of MCS. The AI-enabled tools do not need to take control, but they can surely utilise the available data to provide and present the bigger picture to help decision-makers. This allows humans to focus their efforts on what matters in MCSs: reliability, resilience and recovery. Preventions and dealing with consequences can be delegated to the AI-enabled tools.
What is suggested here in this context is not placing machines at the centre of decision-making process and replacing humans but using them to provide decision support networks that inform system designers. What is covered in the course of this chapter is a new approach to disrupt the development of MCS and to harness knowledge, competence and capabilities in augmenting the performance of MCS and assist in their continued development and modernisation. What has also been acknowledged and recommended is mindfulness about the ethical and moral standards that should be applied in decision-making process during the various stages of design, development and operational phases whilst considering exploitation of such technologies.