Microprocessor Architectures for Safety-Critical Autonomous Systems
Future autonomous transport systems such as driverless cars and small unmanned aircraft (or ‘drones’), must evolve to adopt ‘fail-operational’ functionality in order to achieve full autonomy whilst remaining safe.
Existing systems with low levels of autonomy which employ man-in-the loop control are instead permitted to ‘fail-silent’, requiring conditional human intervention to mitigate hazards. In the case of cars, this still requires a human in the driving seat. For drones and unmanned aircraft, it requires that the remote pilot maintains eyes on the aircraft. For drones, operation Beyond Visual Line of Sight (BVLOS) is currently illegal in the UK for this reason.
Current legislation regarding the operation of autonomous airborne and surface vehicles is rightly prohibitive, but this situation will surely change as the technology, standards and the legal safety cases mature to service a clear market need. The commercial potential for autonomous systems is huge and will change our society. It seems likely that this will happen very soon too; the UK Civil Aviation Authority (CAA) has already declared its ambitions to enable the safe integration of Unmanned Aircraft Systems (UAS) in to the UK’s total aviation system. This means unmanned aerial vehicles flying autonomously, in controlled airspace and clearly beyond the visual line of sight of their operators.
The need for compact, affordable yet robust fail-operational digital control systems is clear. This article explores the implications of this from a system architecture and electronic hardware perspective, with a specific focus on assuring functional safety.
Safety Standards and requirements
One of the foundational Standards for functional safety in electronic systems is IEC 61508. This is a pan-industry set of documents from which other Safety Standards have been derived, such as ISO 26262 for the automotive industry, and IEC 61511 used by the process industry sector. The benchmark for design assurance in the aircraft industry is DO-254 for electronic hardware, along with its embedded software counterpart – DO-178C. Again, these documents form the basis from which aircraft equipment manufacturers have developed their own design processes.
The standards usually don’t tell you what to design or give away many clues about how to implement the system from a hardware perspective. Once you have a design, they do a good job of showing you how to show that a design is fit for purpose, in a process known as Design Assurance. Those experienced in designing safety systems often fall back on legacy best-practice from their respective industries. Many businesses often prepare their own design rules for internal consumption. But if you are new to safety system design it can be difficult to know where to start.
The opening gambit of these standards is always that the system designer needs to know what system-level hazards he or she is trying to mitigate. A formal Hazard Analysis is conducted which records expected system hazards in addition to severity and likelihood of occurrence. In the autonomous car example, a top-level hazard could be ‘Collision with other road vehicles’. The system designer must then go through a process of building a conceptual model of the system in order to identify systematic failures which might lead to this hazard, and then assign qualitative and quantitative safety requirements to functions, and ultimately physical components of the system. The requirements are driven out of severity and likelihood of occurrence. This is a lengthy exercise, but once complete, conclusions can be drawn about the type of system architecture required to reduce the risk of the hazard occurring to acceptable levels. And to put the term ‘acceptable levels’ in to perspective, the civil aviation industry considers a “maximum average probability of catastrophic failure per flight hour of 1 x 10-9”, as being ‘acceptable’ and ‘extremely improbable’.
IEC 61508 manages safety requirements through a tiered set of Safety Integrity Levels (or SILs). There are four ratings ranging from SIL 1 to the most severe, SIL 4. SILs can be attached to system functions and to physical components and assemblies, such as circuits, digital logic and microprocessors. Each SIL rating attracts requirements which identify qualities which must be present in the implementation, such as physical segregation of safety-critical functions and signals, in addition to quantitative figures for system failure rates. To achieve SIL 4, a system must also have a probability of dangerous failure less than 1 x 10-8 per hour. These figures are progressively relaxed for lower SIL ratings, as shown below:
The process of assigning SILs to functions is an inexact science, but it basically boils down to the severity of consequences and frequency of occurrence of the failure event, as discussed. For illustrative purposes, assuming the frequency of occurrence is low, catastrophic or major system failures which could lead to one or more fatalities are likely to attract a SIL 4 rating. Hence, digital control systems for autonomous vehicles and autonomous BVLOS drones fall squarely into this category.
With these considerations in mind, it soon becomes apparent how most commercially available drones are unsuitable for BVLOS operations, and why the police and CAA can prosecute those flying them over urban areas. The overwhelming majority of commercially available drones employ a single centralised microprocessor to implement the guidance, navigation and flight control functions. This represents a single point of failure in the system. With reference to MIL-HDBK-217F, which offers guidance regarding electronic component reliability, the calculated probability of failure per flight hour of an industrial 32-bit microprocessor operating in an uncontrolled, airborne uninhabited environment evaluates to 0.6 x 10-6, or at best, SIL 2. Further work is therefore needed to raise the flight control system hardware to SIL 4 status before a safety case can be constructed to support legal operation of drones over population centres, fully autonomous flight, and operation beyond visual line of sight. Similar arguments exist for autonomous road vehicles.
Fail-safe vs. Fail-operational
Another important consideration is what the system is required to do when a failure does occur. Generally, safety systems are required to either ‘fail-safe’, or ‘fail-operational’. Upon failure, a fail-safe system enters a known safe state by either deactivation or degrading its functionality. Fail-safe systems are employed, for example, in factory environments for disabling hazardous machinery on detection of a fault condition. Fail safe systems have an immediate safe state. On the other hand, fail-operational systems continue to perform their required function even in the event of one or more failures, depending on their required levels of fault tolerance. Autonomous vehicle control systems are required to continue to operate in the event of one or more system failures and are therefore required to have fail-operational functionality.
System architectures which provide fail-operational functionality do so through redundancy. Safety critical system elements are duplicated so that if one channel fails, the system continues to operate by switching to a reversionary state through a dynamic process of voting, or by falling back on pre-defined reversionary system configurations. Some of the processor architectures for doing this are explored below.
1oo1 Simplex Architecture
The simplest and cheapest form of microprocessor architecture is the ‘1oo1’ (One-out-of-one) or simplex architecture. This comprises a single microprocessor. No fault tolerance or failure mode protection is provided by this system. During failure conditions, the outputs of the microprocessor are unpredictable. Therefore, the system can fail into either a safe, or a dangerous state. If high levels of safety integrity are required, it is therefore unsuitable for fail-safe systems, and because there is no tolerance to the fault, is also unsuitable for fail-operational systems. It should be noted that whilst this type of architecture is intolerant to faults, the probability of system failure could still be very low if the constituent elements of the system are themselves sufficiently reliable. This may be the case for vary simple analogue or mechanical systems, but for digital microprocessors, where the physical device complexity is high, 1oo1 systems are not usually employed for safety-critical systems. Incidentally, this type of processor architecture is employed in most commercial drone autopilots today, and is a clear shortcoming of these products.
1oo2 Duplex Architecture
The 1oo2 (One-out-of-two) duplex architecture was developed to improve the fault-tolerance of simplex 1oo1 systems. However, as the processors may not fail-silent, this type of architecture is not suited to fail-operational systems because if either of the two processors fail, the combiner has no way of knowing which microprocessor is generating the correct output. However, it can be used for fail-safe systems because the healthy processor can always disable the output via the combiner.
‘1oo1D’ stands for one-out-of-one-with-diagnostics. The idea here is to create a simple simplex architecture which is assured to fail-silent during fault conditions. This is achieved via dedicated internal diagnostics. If the device detects an internal error, it sets an error pin, which can be used by an external circuit to send the system into a predetermined safe state. So, whilst this architecture does not solve the fault tolerance and availability issue, it can be set up to fail in a predictable and safe way, so is appropriate for use with fail-safe systems. It is theoretically possible to achieve SIL-3 with 1oo1D systems for fail-safe applications. The diagnostic function is achieved by having two processing cores inside the same integrated circuit – a primary core and a checker core. Both execute the same code in lock-step. Dedicated hardware inside the device continually checks the registers in both cores for parity. If the two cores disagree, the compare module sets the error pin on the device. This event can often also be used to trigger internal interrupts, or to reset the device.
2oo3 Triplex (or TMR) Architecture
2oo3 (Two-out-of-three), also often referred to as Triple Modular Redundant (TMR) systems, employs three equivalent and active processing systems, each running the same algorithms. The outputs of the three processors are fed in to a majority voting system, which decides the ‘correct’ output by checking for parity between at least two processors. If any of the three processors fail, their output is automatically rejected by the majority voting system. Each of the three independent channels are simplex and can fail in isolation, into either safe or dangerous states. This approach is commonly used in aircraft flight control systems. The major drawback of this type of system is that it is physically complex, and costly from a hardware perspective. Whilst this approach has been proven to be highly effective for fail-operational systems, it is unattractive for use in smaller autonomous systems due to its physical size and complexity. More cost effective fail-operational systems can instead be built using the 2oo2D or ‘dual-dual’ architecture.
Dual-Dual Architecture (2oo2D)
Dual-dual or two-out-of-two-with-diagnostics employs two independent, fail-silent processing channels. Each channel is typically implemented using a 1oo1D architecture. This architecture is used extensively in aircraft FADEC systems (Full Authority Digital Engine Control). It is also becoming popular in the automotive industry. The key property of this architecture is that either of the two channels fails-silent in the event of a fault condition. When the ‘compare error’ signal is issued by the failed channel, control is switched to the healthy channel by the combiner. In normal operation, both channels are active but only one is ever in control of the output. It is important to maintain independence between the two channels. This type of architecture is also well suited to distributed processing systems, where processing nodes are connected by a communication network or bus. Each of the two channels could exist as physically distributed 1001D nodes on the same network. Collaboratively, they behave as a 2oo2D system. Such an arrangement would also satisfy the requirements of SIL-4 systems to employ more than one processing IC for the core logic solving function.
Common causes of faults in microprocessor systems
The causes of faults in microprocessor systems fall into one of two categories. Systematic or deterministic faults and random faults. The most common failure modes are tabulated below:
The need for diversity
Redundancy addresses random faults, but if all redundant channels employ an identical implementation, redundancy alone does not address systematic faults very well. Duplication can leave residual common-mode faults which can affect all redundant channels, possibly simultaneously, hence negating the benefits of a redundant architecture, and ultimately compromising the integrity of the safety system.
The fix for common-mode faults is physical, temporal, functional or technological diversity between redundant channels. By building redundant channels in different ways, common-mode faults can be reduced. For example, two redundant channels could employ different types of microprocessors produced by different manufacturers, with code written by different engineers (although, this approach does start to become expensive). Better yet, technological diversity could be employed by using a classical microprocessor for ‘Channel A’, and an FPGA solution for ‘Channel B’, for example. If identical hardware is unavoidable, physical diversity can be employed by physically locating them in different regions of the assembly, preferably in different orientations. This means that they both experience a different physical or electromagnetic environment, which might be a cause of faults. Temporal diversity can also be employed by ensuring that identical code is not executed at precisely the same time across channels, and for dual-channel communication busses, identical messages on each redundant channel are not sent at the same time. Again, this mitigates issues arising from environmental events which could corrupt data. Functional diversity can be achieved by achieving a common goal but doing so with the use of different processes or information sources. For example, if a dual-channel system requires a periodic altitude fix, ‘Channel A’ could determine altitude from a barometric sensor, whilst ‘Channel B’ could extract the required data from a diverse source, such as a GPS signal or a radar altimeter.
When designing multi-channel safety systems, it is always important to consider how you can employ different forms of diversification between redundant channels, whilst still achieving the desired outcome.
Microprocessor architecture selection
The following table can be used as a guideline for microprocessor architecture selection for a given SIL requirement. This should be considered guidance for a starting position only, and selection may need to be reconsidered as the design assurance activities progress. It should also be remembered that the functional safety requirements exist at a system level, and that any safety chain is only as good as its weakest link. Even if a very high integrity processing system is implemented, it is only effective if all other parts of the functional chain, such as sensors, actuators and interface circuitry are of equally high safety integrity. This usually means redundancy and diversity in the sensor and effector components too.
There are several interesting microcontroller families on the market today which respond to the need for high SIL and fail-operational requirements. Two notable families of devices are the ‘Hercules’ safety microcontrollers from Texas Instruments, and the ‘AURIX’ safety microcontroller family produced by Infineon.
The Hercules implements a 1oo1D lock-step architecture and is certifiable to SIL 3. Fail-operational systems could theoretically be built which are certifiable to SIL 4 using a pair of Hercules chips, or fail-safe systems to SIL 3 using a single chip solution.
The AURIX family is more sophisticated and incorporates up to three 1oo1D cores in a single chip, in what Infineon calls their Tri-Core architecture. Theoretically it is possible to build 2oo3 or 2oo2D solutions with a single AURIX chip, although these could only reach SIL 3 safety integrity levels, due the need for multiple ICs for a SIL 4 compliant application per IEC 61508. However, unlike the Hercules, SIL 3 fail-safe and fail-operational systems could be built using a single AURIX integrated circuit. Like the Hercules, SIL 4 applications would require a multi-chip solution.