Design for Safety
Software design must enforce safety constraints. Reviewers should be able to trace from requirements to code and vice versa. In addition to the specific safety constraints developed for the system being designed, the design should incorporate basic safety design principles.
Hazard elimination is the least expensive, and most effective, method of handling system hazards. If addressed early in the system design process, hazards can often be eliminated at almost no cost whatsoever.
Substitution may be applied to eliminate hazards in several ways. Safe or safer materials can be used in place of hazardous ones. Simple hardware devices are often safer than using a computer to enforce safety constraints. There is no technological imperative that says we must use computers to control dangerous devices. Introducing new technology introduces unknowns and even unknown-unknowns.
Simplification may also eliminate hazards. Simple software designs should be testable. The number of states of the software should be limited. Determinism should be preferred over nondeterminism. Multitasking designs are much more complicated, so single tasking should be used instead. And polling should be used instead of interrupts, wherever possible. Software designs should also be easily understood and readable.
The interactions between software components should be limited and straightforward. Reducing and simplifying interfaces will eliminate errors and make designs more testable. Individual components should include only the minimum feature set and capability required by the system. It is easy to add functions to software, but hard to practice restraint. Unnecessary or undocumented features add complexity. Constructing a simple design requires discipline, creativity, restraint, and time. The structural decomposition should match the functional decomposition so that it is easy to map chunks of the program to their intended purpose.
A tightly coupled system is one that is highly interdependent. Each part is linked to many other parts. Failure or unplanned behavior in one can rapidly affect the status of others. If processes are time-dependent and cannot wait, there is little slack in the system for unusual circumstances. Likewise, a tightly coupled system often has invariant sequences and only one way to reach the program's goal. System accidents are often caused by unplanned interactions. Coupling creates an increased number of interfaces and potential interactions. Unless carefully controlled, computers tend to increase the coupling in a system. There are several principles of decoupling that can be applied to software designs. Software can be modularized, so that functionality is divided into discrete units. Firewalls (not in the security sense) can be used to prevent communication between parts of the system that should not interact. Read-only or restricted write memories can prevent coupling by controlling who can affect certain data values. Lastly, decoupling software can eliminate the hazardous effects of common hardware failures.
Elimination of human errors requires reducing the opportunity for human error by design. In general, humans are good at reacting to their own mistakes. If a system makes the results of an error clear, the operator may be able to correct it. There are many ways to increase the safety of human-computer interaction. Make sure that the status of components is clear to the operator at all times. Design software to be error tolerant for the inevitable mistakes in operator entry of commands.
It is also desirable to use a programming language that is not only simple itself, but encourages the production of simple and understandable programs. Some programming languages have been found to be particularly error prone.
Lastly, to reduce hazardous conditions, software should only contain code that is absolutely necessary to achieve the required functionality. This has significant implications for COTS (Commercial Off The Shelf software), which must be designed with a more general marketplace environment in mind. The extra code in COTS may lead to hazards and make software analysis more difficult. Another way to reduce hazardous conditions in software is to initialize the hardware memory to a bit pattern that will revert to a safe state if, for any reason, the instructions start being read from random memory.
The design of a turbine generator makes a good example.
The safety requirements are:
The functioning of the system is divided (decoupled) on separate processors. The first processor controls all non-critical functions; loss of this processor cannot endanger the turbine nor cause it to shutdown. This processor has less important governing functions, supervisory, coordination, and management functions. The second processor has only a small number of critical functions. These functions can be examined with much greater scrutiny.
Hazards may be reduced by passive safeguards, which maintain safety merely by their presence, or by active safeguards, which require the hazard or condition to be detected and corrected. Passive safeguards cause the system to fail into a safe state, whereas active safeguards must become active and direct the system to safety. Passive systems rely only on physical principles, while active mechanisms depend on less reliable detection and recovery means. However, passive safeguards tend to be more restrictive in terms of design freedom and are not always feasible to implement.
Hazards can be reduced by designing the system for controllability. The system can be made easier to control, both for humans and computers. Try to use incremental control. Perform steps incrementally rather than in one step, and provide feedback to test the validity of assumptions and models upon which decisions are made. Providing feedback also allows taking corrective action before significant damage is done. Feedback may also be provided in terms of intermediate states and partial results. Controllability is also enhanced by lowering time pressures, perhaps by slowing the process rate.
Decision aids can also help to control a system, as can use monitoring. It is difficult to make monitors independent, however. Checks require access to information being monitored, but these checks may corrupt that information. Monitoring also depends on assumptions about the structure of the system and about errors that may or may not occur. These assumptions may be incorrect under certain conditions. Common incorrect assumptions may be reflected in both the design of the monitor and the devices being monitored.
In general, the farther down in the hierarchy a check can be made, the better. It means detecting the error closer to the time it occurred and before erroneous data can propagate to other components. It is easier to isolate and diagnose the problem as a lower level. And, the lower the level at which the failure is detected, the more likely the system is to be able to fix the erroneous state rather than recover to a safe state.
Writing effective self-checks is very hard, and the number that can be included is usually limited by time and memory. It is best to limit checks to safety-critical states. Use hazard analyses to determine optimal check contents and locations. And be wary, added monitoring and checks can cause failures themselves.
In addition to designing for controllability, several types of barriers can help in hazard reduction. Lockouts make access to dangerous states difficult or impossible. For software, that means avoiding EMI, limiting authority, and controlling access to and modification of critical variables. Some techniques can be adapted from security for this.
Inversely, lockins make it difficult or impossible to leave a safe state. This addresses the need to protect the software against environmental conditions, such as operator errors or data arriving in the wrong order or at an unexpected speed. Completeness criteria can help ensure that specified behavior is robust against mistaken environmental conditions.
Interlocks can be used to enforce a sequence of actions or events. For example:
Examples of interlocks include batons, critical sections, and synchronization mechanisms.
Remember, the more complex the design, the more likely errors will be introduced by the protection facilities themselves.
Detonation of nuclear weaponry makes a good example for hazard reduction. The safety of a nuclear device depends on that device NOT working. Three basic techniques (called "positive measures") are used to prevent unintended detonation:
Nuclear systems feature:
A diagram of the safeguards against accidental nuclear detonation is shown in the figure below.
The device may require unique signals from several different individuals along various communication channels, using different types of signals (energy and information) to ensure a proper intent.
Another means of reducing hazards is failure minimization. Safety factors and safety margins are used to cope with uncertainties in engineering. These inaccuracies arise from inaccurate calculations or models, limitations in knowledge, and variation in strength of a specific material due to differences in composition, manufacturing, assembly, handling, environment, or usage.
There are some ways to minimize problems when they cannot be eliminated. Safety factors and margins are appropriate for continuous and non-action systems. See the figure below.
Redundancy can increase reliability and reduce failures. However, it assumes a model of random wearout. It is not so effective at common-cause or common-mode failures, which may affect all redundant parts equally. Redundancy can also add so much complexity to the system (to coordinate the redundant components) that the complexity causes failures. Certainly, redundant components are more likely to operate spuriously. And redundant components may cause a false sense of security. This was one of the contributing causes to the Challenger accident. Certainly, redundancy has its place, and it can be useful in reducing hardware failures, but what about software?
Claims are made that design redundancy and design diversity can provide the benefits of redundancy to software. The bottom line is that claims that multiple version software will achieve ultra-high reliability levels are not supported by empirical data or theoretical models.
Schemes have been proposed for standby spared and for concurrent use of multiple devices with a voting scheme to resolve differences. Identical designs may be used or intentionally diverse ones. But diversity must be carefully planned to reduce dependencies. These dependencies may be reintroduced in maintenance, testing, and repair. In the end, redundancy is most effective against random failures, not design errors.
Software suffers from design errors, not random failures. Data redundancy allows for detecting errors in data using schemes such as parity bits, checksums, message sequence numbers, and duplicate pointers or other structural information. Algorithmic redundancy involves multiple versions voting on results. Of course, these versions must be guaranteed to meet the same requirements using difficult to write acceptance tests.
Multi (or N) version programming assumes that the probability of correlated failures is very low for independently developed software. It assumes that software errors occur at random and are unrelated. Even small probabilities of correlated failures cause a substantial reduction in expected reliability gains. Professor Nancy Leveson and John Knight conducted a series of experiments to examine failure independence in N-version programming, embedded assertions versus N-version programming, and fault tolerance versus fault elimination.
The failure independence experiment collected 27 programs written from one requirements specification. Graduate students and seniors from two universities wrote the programs. The evaluation of these programs simulated a production environment, using 1,000,000 input cases. Each of the programs, taken individually, was of high quality. The results of the experiment rejected the independence hypothesis. Analysis of reliability gains must include the effect of dependent errors. Statistically correlated failures result from the nature of the application and the "hard" cases in the input space.
This should make intuitive sense. The unusual corner cases in input that are hard for one designer are likely to be hard for another. For example, imagine a program that takes the coordinates of three points and finds the three angles of the triangle formed by those points. It should seem likely that more designers will have errors in handling the case where all three points lie on one line or are even at the same coordinates. Harder input cases are harder for all designers, so errors are not likely to be randomly distributed around the program.
Furthermore, the programs with correlated failures were structurally and algorithmically very different. The conclusion is that correlations are due to the fact that the problem was the same, not due to the tools used or languages used or even algorithms used.
Multi-version programming also suffers from the consistent comparison problem. The consistent comparison problem arises from the use of finite-precision real numbers (rounding errors). Correct versions may arrive at completely different correct outputs and thus be unable to reach a consensus even when none of the components "fail". This may cause failures that would not have occurred with single versions. In general, there is no practical solution to the problem.
Another experiment was performed regarding self-checking software. This experiment used the launch interceptor programs (LIP) from the N-version programming study. 24 graduate students from UCI and UVA were employed to instrument 8 programs (chosen randomly from the subset of 27 in which error were found). The students were provided identical training materials. In a first round, students wrote checks based solely on the specification for the software, then the participants were given a program to instrument. The students were allowed to make any number or type of check. The students treated this as a competition among themselves to see who could detect the most errors. The data collected is shown below; more errors were added in relation to the self-checking than were found.
Another hope for multi-version programming was that fault tolerance could replace fault elimination. The hope was that if several versions of a program are running and voting on results, one need not eliminate defects from the software. For any given input sequence, the majority of the versions of the software should still vote for the correct answer. Thus, expensive testing and fault elimination processes can be removed from the organization. Experimentation does not support this hypothesis.
Fault tolerance has been compared to fault elimination, including techniques such as run-time assertions (self-checks), multi-version voting, functional testing augmented with structural testing, code reading by stepwise abstraction, and static data-flow analysis. The problem used in the experiment was a combat simulation problem (from TRW). The programmers employed in the experiment were separate from the teams that detected faults in the software. Eight versions were produced with 2 person teams. The number of modules varied from 28 to 75, and the number of lines of code from 1200 to 2400. The experimenters tried to hold the resources constant for each technique.
The results showed that multi-version programming is not a substitute for testing. The resultant system did not tolerate most of the faults detected by fault-elimination techniques. The system was also fairly unreliable in tolerating the faults that it was capable of tolerating. The scaled-back testing done in conjunction with the multi-version project was not able to detect errors that cause coincident failures across multiple versions of the software. The results cast doubt on the effectiveness of multi-version voting as a test oracle. Instrumenting the code to examine internal states was much more effective. Lastly, the intersection of the sets of fault found by each method was relatively small.
In summary, these results don't necessarily mean that N-version programming shouldn't be used, but it is important to have realistic expectations of the benefits to be gained and the costs involved. The costs are very high: more than N times the cost for one version of the software. In practice, there will be a great deal of similarity in the designs produced. If mid-algorithm cross checks are used between versions, then even more similarity of designs will result as each version must produce the same interim values. Because N-version programming depends on design diversity, this means that the safety of the system is dependent on a quality that has been systematically eliminated. There is no way to tell how different two software designs are in their failure behavior. Lastly, requirements flaws are not handled by multiple implementation versions, and requirements specifications are where most safety problems arise.
Recovery techniques are also sometimes applied to software. Recovery comes in two forms. Backward recovery is a process of detecting an error and returning to a known good state. This assumes that the error can be detected before it does any damage, and it assumes that the alternative (state that the system recovers to) is more effective than the failure state. Forward recovery uses robust data structures, dynamically altered flow control, and ignoring single cycle errors. The real problem is detecting erroneous states.
The first technique of hazard control is limiting exposure. The system should start out in a safe state and require deliberate change to move to an unsafe state. Critical flags and conditions should be written as close to the code they protect as possible. And critical conditions should never be complementary; for example, absence of an armed condition should be used to indicate that the system is unarmed.
Isolation and containment is also used to control hazards. An example of these kinds of controls are physical barriers such as concrete walls. Protection systems and fail-safe design are also ways to control hazards. These depend on the existence of a safe state and the availability of adequate warning time. There may be multiple safe states, depending upon process conditions, so a way to choose between them is necessary. The general rule of thumb is that safe states should be easy to get into and hazardous states should be hard to get into. A good example is a chemical process that may take hours to start but can be stopped nearly instantly if the operator presses a panic button. Watchdog timers are similar; they time software to see if it appears to have gone dead. If so, the watchdog timer signals some problem. The software the watchdog timer observes should not be responsible for setting the timer, however. Sanity checks are also a good form of fail-safe design, as are "I'm alive" signals. Protection systems should provide information about their control actions and status to operators.
It is important to consider the social engineering of protection systems as well. Management and operators make changes to procedure and devices once a system is in use. The easier and faster it is to return a system to operational state, the less likely it is that the protection systems will be purposely bypassed or turned off.
It may be necessary to determine a "point of no return" beyond which recovery is not likely or even possible. Beyond that point, the goal is simply to minimize the damage done.
Modification and Maintenance
Many accidents happen when systems are modified and maintained. Systems evolve. Operators and management change procedures. New equipment may be added to existing systems. Repairs and replacements are carried out. Changes that affect the design of the system must be reanalyzed for their impact on the safety of the system. When that reanalysis is carried out, it is essential for the system documentation to be updated with the design rationale that supports the changes. This will help to preserve the system safety sought in the initial system design.
Copyright © 2003 Safeware Engineering Corporation. All rights reserved