Overview of a Software Safety Approach
Engineers should recognize that reducing risk is not an impossible task, even under financial and time constraints. All it takes in many cases is a different perspective on the design problem.Mike Martin and Roland Schinzinger
System safety is a planned, disciplined, and systematic approach to preventing or reducing accidents throughout the life cycle of a system. The primary concern of system safety is the management of hazards. A hazard is a system state that, together with certain conditions in the environment of the system, will lead to an accident. Hazards are managed through identification, evaluation, elimination, and control. System safety brings a number of tools to this task, including techniques in analysis, design, and management. There are standards devoted to system safety practices, such as MIL-STD-882.
Building in Safety
System safety emphasizes building in safety rather than adding it on to a completed design. Many safety efforts fail because safety is evaluated at the end of the product life cycle. As much as 90% of the decision-making related to safety happens very early in a system's development cycle. And changes are many times cheaper to make earlier in development. Assessing safety at the end of design, and perhaps even prototyping, leads to safety devices being grafted on to finished products at great expense.
An example of this phenomenon is the change in door mechanisms for refrigerators. Refrigerator doors used to have a handle latching mechanism to keep the door closed. Children could become trapped in a refrigerator if the door closed and latched; this was particularly a problem when old refrigerators were disposed of. For many years, refrigerator manufacturers insisted that the cost of increasing the safety of refrigerators would be prohibitive, devastating their profits. In the end, legislation as passed requiring them to solve the problem; the solution is the present day magnetic strip around the door. This is a much safer design; the door can be opened easily by someone trapped inside. The new design also turned out to be cheaper to produce. The manufacturers did the right thing in changing the design of the product.
Imagine instead any of the industries well known for tacking on safety devices to control the hazards in their systems. Instead of the elegant solution above, a sensor would be used to detect someone trapped inside. If the sensor tripped, a flow of oxygen would be started from a pressurized tank to prevent suffocation of the occupant until someone discovered them. Removing hazards from a design earlier in the process is cheaper and more effective.
The system is considered as a whole, not just as a collection of components. System safety takes a larger view of hazards than just component failures. Most accidents come from the complexity of interactions among system components, not from component failures. Because reliability improvements address component failures, safety must go beyond reliability improvement. System safety emphasizes hazard analysis and designing to eliminate or control hazards. The analyses of system safety tend to be qualitative rather than quantitative.
Hazard analysis and control is a continuous, iterative process throughout system development and use. Hazard identification should begin as early as the conceptual development of the system and continue all the way through. As soon as design begins, hazards can begin being controlled. During development, it is possible to verify that hazards have, in fact, been controlled by the design measures already imposed. It is important to note that system safety doesn't end with the delivery of the product; any changes to a system in use must be analyzed for their potential impact on safety. Feedback must be collected from operational use to identify any hazards not detected and controlled in earlier steps. Operational feedback also provides lessons learned for future projects.
There is a clear precedence to resolving hazards:
The best alternative is to eliminate the hazard. Often, if done during the requirements specification or design phase, hazard elimination adds no cost. For example, choosing to design a refrigerator with a magnetically held door rather than a latched door removed the hazard that someone will become stuck inside and lowers the manufacturing cost all at once.
There may be some hazards that cannot be eliminated from the system. The goal in this case is to minimize the occurrence of the hazard. One way to do this is to carefully control the conditions under which the system can move from a safe state to a hazardous state. Consider the example of a weapon system. The system is comparatively safe if it is not armed. Arming the weapon puts it into a hazardous state; it may be lethal or severely damaging if triggered. For proper functioning of a weapon, one cannot eliminate the armed state. The second option is to minimize the chances that a weapon could be armed under hazardous circumstances. An example of the way to do this is to send the arming signal as a five digit code rather than a single wire with high or low voltage. That way, a transient spike on one wire is not capable of arming the weapon. Even non-computer controlled weapons, such as firearms, are safely stored unloaded, to minimize the hazard that the armed state presents. While weapons make a convenient example of this technique, they are by no means the only systems in which one can minimize the occurrence of hazards that cannot be eliminated.
If steps 1 and 2 cannot be applied, the third option is to try to control the hazard if it occurs. One way to do this is through operator training. If an automated factory floor cannot be designed to eliminate the risk of injury to workers, then the training manual should emphasize that workers must shut down the automation before entering the factory floor. It should be noted that controlling hazards often involves social factors, such as employee training and company practice. Management support is imperative when using this technique. There was a case where machinery on an automated factory floor broke down often enough that to maintain productivity, the workers went in while the system was active. A robot crushed one of these workers, killing him.
Lastly, some benefit can come from minimizing the damage if a hazard does lead to an accident. Passive safety measures such as safety barriers, spillover containment vessels, and so on are examples of minimizing the damage after a hazard leads to an accident.
It should be noted that these four steps do not have to be taken one at a time. If a hazard cannot be eliminated, but the occurrence can be minimized, it is still a very good idea to try to control the hazard and minimize the damage it may cause if it leads to an accident. Safety is served well by the doctrine of defense-in-depth. The more means that can be used to increase the safety of the system, the better.
The diagram below shows the steps of a safety process.
The first step in the safety process is to perform a preliminary hazard analysis. The output of this analysis is a hazard list. Next, a system hazard analysis is performed using the hazard list. This analysis is used to indentify potential causes of the hazards in the hazard list. Note that the system-level hazard analysis is not just a failure analysis. Failure analysis only deals with component failures; system safety must include interactions between correctly functioning components that may lead to hazardous states.
Once the potential causes of hazards have been uncovered, design constraints can be placed on the system, software, and human operators. With these constraints in mind, the system can be designed to eliminate or control hazards. Any hazards that cannot be fully resolved within the system-level design must be traced down to component requirements, such as software requirements. This traceability is very important, as it is the only way to ensure that remaining hazards are eliminated or controlled within the context of individual components. Safety is an emergent property of the system, but it will frequently impose constraints on component design and implementation.
The second reason unresolved hazards must be traced to software components in particular involves the way software requirements are often specified. Most software requirements only specify nominal behavior. In essence, requirements are written to explain what software must and should do. However, safety is a property that impacts software by specifying what software must not do. Many people already have some understanding of this quandary from the field of security. In security, the problem is similar that software must make guarantees about what it will not do (such as allow an intruder access to a host computer). It is much easier to guarantee that software can do something than to guarantee that it can't. It is also important to understand that what the software must not do is not the inverse of what the software must do. It would be impossible to demand that implementors rule out any behavior that is not explicitly written into the specification. Software designers and implementors must constantly make trade-off decisions about how software will function. Disallowing anything outside of the specification is a demand that's impossible to meet. Therefore, it is important to use the system hazard analysis to derive constraints on the behavior of software. The software component designers and developers can then take these constraints (with traceability maintained from the system-level analyses) and write software that is safe within the context of the system.
After the software requirements are written, the next step in the process is to review and analyze them. Several analyses can be employed:
Safeware Engineering Corporation has compiled a number of techniques drawn from research and industry for analyzing the safety of software requirements.
Implementation must proceed with safety in mind. Software developers should use defensive programming practices. Where possible, it should not be assumed that correct parameters are passed in or that called functions operate correctly. Assertions and run-time checks can be used to sanity check arguments and return values. It is also good practice to separate critical functions from the rest of the code; the critical functions can then be more carefully reviewed and audited independently. Unnecessary functions should be eliminated from the code base to prevent confusion. Lastly, the choice of language does make a difference. Some languages offer safety enhancing language features, such as good exception-handling mechanisms. Popular trends or developer preference may suggest languages well-known to be unsafe and error prone, but it is better to choose carefully.
Once constructed, the software needs to be tested in a way that will expose unsafe operation of the software. Software is often tested to verify that it performs the desired function. Very little testing is devoted to off-nominal cases, where the software is deliberately fed bad data or incorrect environmental assumptions. Testing this off-nominal behavior is often essential to testing the safety of the software. Additionally, the system hazards related to the software component should have been traced down to the software requirements. This traceability can also be used to generate test cases that deliberately test the software's adherence to its safety constraints. This late in the enactment of the safety process, testing should really just be a way to confirm that the other efforts have succeeded. If testing does uncover a hazardous defect that could cause an accident, then something has gone very wrong. Some safety engineers go so far as to say that the effort has failed if testing reveals anything that could cause an accident.
The last stage of the safety process occurs after the system is in service. Every system evolves, so change requests are inevitable. Many accidents have occurred because changes were not sufficiently evaluated for their safety implications. It is very important that the safety of the system be verified after a change in the requirements. Traceability makes this process much easier, as the requirements can be followed through the design, implementation, and tests as well. Incidents and accidents must also be recorded and analyzed. The trends of incident and accident data provide information about hazards that may have been missed during the earlier steps of the safety program. Periodically, the system should be audited; this provides data for safety engineers to evaluate their environmental assumptions. The audit results may indicate necessary changes to operator training manuals or maintenance procedures.
Safety-driven development processes require effort. Some of this effort can be simplified with the proper tools. At Safeware Engineering Corporation, we are developing an integrated set of tools to assist in building complex control systems. We call this tool SpecTRM (Specification Tools and Requirements Methodology).
SpecTRM assists in building intent specifications. Developed by Professor Nancy Leveson, an intent specification is, itself, a tool for safety-driven system design. The intent specification is a prescribed format for writing system specifications. How information is presented makes a great deal of difference in how successful engineers are at problem-solving. Intent specifications support the cognitive efforts required for system requirements specification and design.
Intent specifications are very readable and reviewable documents. Their format can bridge between disciplines, allowing engineering specialists and domain experts to share information more effectively. The same format supports the cognitive efforts required for human problem-solving. Traceability is an integral part of intent specifications, ensuring the the rationale for decisions is available at every step. Intent specifications support upstream safety efforts by emphasizing requirements, where the majority of decisions impacting safety are made. By integrating safety information into the system specification, safety information is presented in the decision-making environment; this avoids the problem of safety work being done "out of sight, out of mind".
Intent specifications are hierarchical abstractions based on why (design rationale) as well as what and how. Traditionally, specifications are levelized. Every level of a specification describes what the level below does. Each level, in turn, also describes how the level above is accomplished. This refinement abstraction continues down through the specification until very specific design details are fleshed out. These specifications leave out rationale, however. One of the most important questions users of a specification have is why something was done the way it was. Often, systems cannot evolve smoothly because engineers are afraid to make any changes. They cannot remember why something was done and are afraid that changing it may have a disastrous impact on the system. The intent specification preserves information about why decisions were made, easing system review and evolution.
At each stage, design decisions are mapped back to the requirements and constraints they are derived to satisfy. Earlier stages of the process are mapped to later stages. The organization of the specification results in a record of the progression of design rationale from high-level requirements to component requirements and designs.
Each level of the intent specification supports a different type of reasoning about the system. Mapping between levels provide relational information necessary to reason across hierarchical levels.
We can demonstrate part of an intent specification by excerpting from the intent specification for TCAS. TCAS is a collision avoidance system used on aircraft. If aircraft violate a minimum separation, the pilot is advised to execute an escape maneuver that will move the aircraft to safe separation.
Level 1: System Purpose
To this point, hazards have been a central focus of the description of system safety. The process has supported the goals of identifying and eliminating hazards. It has been emphasized that hazards are not potential failures, however, but states in the system that, combined with the environment, lead to accidents. The hazard list for TCAS is provided as an example below.
Causes of these hazards can be uncovered by techniques such as qualitative fault tree analysis (FTA). An excerpt from the TCAS FTA is shown below. The fault tree is not presented in its traditional box and line format because it was too hard to fit the text into boxes, but the tree structure should still be apparent.
In an intent specification, links are made from the leaf nodes in the fault tree to constraints in the design that control the risk those faults present. For example, the "Uneven terrain" box in the fault tree above links down to an entry in level two of the intent specification. That entry is reproduced; notice that traceability is maintained in both directions because the entry links back up to the fault tree. This traceability information allows changes to be easily evaluated for their impact on safety.
2.19 When below 1700 feet AGL, the CAS logic uses the difference between its own aircraft pressure altitude and radar altitude to determine the approximate elevation of the ground above sea level (see the figure below). It then subtracts the latter value from the pressure altitude value received from the target to determine the approximate altitude of the target above the ground (barometric altitude - radar altitude + 180 feet). If this altitude is less than 180 feet, TCAS considers the target to be on the ground (^ 1.SC4.9). Traffic and resolution advisories are inhibited for any intruder whose tracked altitude is below this estimate. Hysteresis is provided to reduce vacillations in the display of traffic advisories that might result from hilly terrain (^ FTA-320). All RAs are inhibited when own TCAS is within 500 feet of the ground.
Below is another example of level two of an intent specification. The design is influenced by the safety constraints from level one and are linked back to provide a rationale for the design.
SENSE REVERSALS vReversal-Provides-More-Separationm-301
2.51 In most encounter situations, the resolution advisory sense will be maintained for the duration of an encounter with a threat aircraft (^ SC-7.2). However, under certain circumstances, it may be necessary for that sense to be reversed. For example, a conflict between two TCAS-equipped aircraft will, with very high probability, result in selection of complementary advisory senses between the two aircraft because of the coordination protocol between the two aircraft. However, if coordination communications between the two aircraft are disrupted at a critical time of sense selection, both aircraft may choose their advisories independently (^FTA-1300). This could possibly result in selection of incompatible senses (^FTA-395).2.51.1 [Information about how incompatibilities are handled.]
Through levels one and two, conceptual design, high-level goals, and safety-related constraints are translated down into system design. Level three of an intent specification continues from levels one and two, including a black box model of the system. The model is based on automata theory, so it is formal enough that the model can be analyzed and executed. However, the language used for the model is understandable enough that it can be used as the specification. It is readable and reviewable by domain experts, and it can be used by software engineers as a specification.
An example from level three of the intent specification for TCAS follows. Notice that it links back up to level two to preserve the intent behind the choices made in the black box model. It also links down to level four where that level is influenced by the model. This particular example is a model of the INTRUDER.STATUS state element. TCAS considers any other aircraft to be an intruder. Some may be harmless, and some may require advisories. This state classifies an intruder. This state element can be in the state Other-Traffic, Proximate-Traffic, Potential-Threat, or Threat.
The table in the model specification describes the transition from Threat to Other-Traffic. The transition is described by an AND/OR table. If the table is true, then the transition takes place. If the table is not true, then the transition does not take place. The table is an OR of the columns. So if any one column is true, the whole table is true. The rows are AND, conditions, so a column is true if every row in that column is true. The expressions in the first column are all either true or false. They are compared to other cells in the table. A true first column expressions matches against a "T" or an "*"; a false first column matches against "F" or an "*". The "*" represents "don't care."
So, to read the table below, threat will transition to Other-Traffic if
With just a few moments, most people are able to pick up how to read these tables. Domain experts without engineering experience or mathematical training can review the specification and point out problems. Despite the ease of use and readability, these kinds of descriptions of systems can be simulated and analyzed as well.
At the time this web page was written, models had been or were being built for:
Copyright © 2003 Safeware Engineering Corporation. All rights reserved