Software Hazard Analysis
Subsystem hazard analysis (SSHA) examines subsystems to determine how their
could contribute to system hazards. SSHA also determines how to satisfy design constraints on the subsystem design. Lastly, subsystem hazard analysis validates that the subsystem design satisfies safety design constraints and does not introduce previous unidentified hazardous system behavior.
Software hazard analysis is a form of subsystem hazard analysis. It validates that specified software blackbox behavior satisfies system safety design constraints. Software Hazard Analysis checks that specified software behavior satisfies general software system safety design criteria, as well. This analysis must be performed on all software in the system, including COTS.
Like system hazard analysis, software (subsystem) hazard analysis requires a model of the component's behavior. Using code is too hard, there is too much implementation complexity to look at the specification of behavior. Examination of code is also too late in the process. If changes must be made, by the time the code is written the effort will be too costly. Software is too complex to do analysis entirely in one's head.
Formal models are useful, but they need to be easily readable and usable without graduate-level training in discrete math. Only a small subset of errors are detectable by automated tools: the most important ones require human knowledge and expertise. Mathematical proofs can be developed based on formal systems, but these proofs must be understandable and checkable by application experts. Frequently the proofs produced are more complex and error-prone than the systems they describe. The hazard analysis process requires results that can be openly reviewed and discussed.
State Machine Models of Blackbox Requirements
State machines make a good model for describing and analyzing digital systems and software. State machines match intuitive notions of how machines work. Some other specification languages, such as those based on sets, do not. State machines have a mathematical basis, so they can be analyzed; they also have graphical notations that are easily understandable. Previous problems with state explosion in state machine models have been solved by "meta-modeling" languages so complex systems can be handled.
Some analyses can be automated and tools can help human analysts to traverse (search) the model. Our experience is that assisted search and understanding tools are the most helpful in hazard analysis. Completely automated tools have an important but more limited role to play.
An example state machine follows.
Requirements are the source of most operational errors and almost all software contributions to accidents. Much of software hazard analysis should therefore focus on requirements. The problem is dealing with complexity. One step in controlling complexity is to separate external behavior from complexity of internal design to accomplish the behavior.
Abstraction and metamodels can be used to handle the large number of discrete states required to describe software behavior. Continuous math (which works well with large ranges) is not available for help. But new types of state machine modeling languages drastically reduce the number of states and transitions that the modeler needs to describe.
Blackbox specifications provide a blackbox statement of software behavior. Statements are permitted only in terms of outputs and externally observable conditions or events that stimulate or trigger those outputs. A complete trigger specification must include the full set of conditions that may be inferred from the existence of the specified output. Such conditions represent assumptions about the environment in which the program or system is to operate. Thus, the specification is the input to output function computed by the component, i.e., the transfer function. Internal design decisions are not included.
Process models define the required blackbox behavior of the software in terms of a state machine model of the process (called the plant in control systems terminology).
Accidents occur when the three process models do not match and incorrect control commands are given (or correct ones are not given). How do these models become inconsistent?
The controller's model of the automation must also be accurate. But we often find that operators do not understand the automation. They often have questions like:
Operators also may not receive updates to their mental models, or they may disbelieve the feedback the system gives.
Level 3 of an intent specification contains a model. The model is constructed using the SpecTRM-RL modeling language. SpecTRM-RL has goals to be readable and reviewable. The model should minimize the semantic distance between the modeler and the system. The model should also be a minimal model, including only blackbox behavior and not internal design. The SpecTRM-RL modeling language is easy to learn and has unambiguous, simple semantics. Lastly, the language is analyzable, including execution and formal analysis.
SpecTRM-RL combines utility of requirements specification languages and modeling languages. The language is based on state machines but the syntax is very readable. The language includes or enforces most of the completeness criteria developed for safe software system development. SpecTRM-RL supports specifying systems in terms of modes: control and supervisory.
In SpecTRM-RL, the process is modeled using state variables. And example of two state variables is shown in the figure below.
In a graphical depiction, the SpecTRM model is written as shown in the figure below. Like control system theory, the block diagram is written with the system in the center. Inputs are above in the delay (or to the left) and outputs are down and to the right.
Because of SpecTRM-RL's formal foundation, many analyses can be applied to the model, including:
In theory, it may be possible to generate code directly from the requirements.
SpecTRM-RL models are executable, and model execution is animated in the graphical display of the model. The results of the execution could be passed into a graphical visualization, showing the system in operation. Inputs can come from another model or simulator and output can go into another model or simulator.
Completeness analysis is another desirable benefit from SpecTRM-RL models. Most software-related accidents involve software requirements deficiencies. Accidents often result from unhandled and unspecified cases. We have defined a set of criteria to determine whether a requirements specification is complete. These completeness requirements are derived from basic engineering principles. These criteria have been validated (at JPL) and used on industrial projects. Completeness can be defined as the property that requirements are sufficient to distinguish the desired behavior of the program from that of any other undesired program that might be designed.
Requirements Completeness Criteria
The completeness criteria were derived by mapping the parts of a control loop to a state machine. Completeness for each part of the state machine (states, inputs, outputs, transitions) was defined. Basic engineering principles (e.g., feedback) were added as well. Additional criteria have been added based on lessons learned from accidents. We have about 60 criteria in all, including human-computer interaction. There are too many for all of them to be included in this article (although some will be, for demonstration). Most of these criteria are integrated into the SpecTRM-RL language design, so that writing a model in SpecTRM-RL forces the criteria to be addressed. Many can also be checked by simple tools that we are in the process of developing.
The completeness criteria address:
To use startup and state completeness as an example, many accidents involve off-nominal processing modes, including startup and shutdown and handling unexpected inputs. Examples of completeness criteria in this category are:
Failure states and transition criteria require that the following need to be completely specified.
Most accidents occur while in off-nominal processing modes. As a brief example, two plants from the nuclear industry, Three-Mile Island and Chernobyl illustrate two of the points above. At the Three-Mile Island accident, the line printer that printed errors fell hours behind the state of the system, illustrating that communication with the operator during failure modes must be considered. Another point, that off-nominal states and transitions must be considered, comes up in the Chernobyl accident, where the operators were running a test with the reactor at the time the accident began.
Criteria for input and output variable completeness were mentioned above as well. At the blackbox interface to the software, only time and values are observable to the software. So, triggers for outputs and output values must be defined only as constants or as the value of observable events or conditions. The completeness criteria are:
Trigger events have their own completeness criteria. The behavior of the computer should be well defined with respect to assumptions about the behavior of the other parts of the system. A robust system will detect and respond appropriately to violations of these assumptions (unexpected inputs). Therefore, robustness of software built from specification will depend on completeness of specification of environmental assumptions. This criteria can be succinctly written that:
It is very important to have the assumptions on inputs documented and checked.
To be robust, the events that trigger state changes must satisfy the following:
Together these criteria guarantee handling of input that are within range, out of range, and missing.
The behavior of the requirements should be deterministic. That is, only one possible transition out of a state should be applicable at any given time. For instance, if one transition is taken when X > 0 and another when X < 2, then which one is actually taken at any given time is implementation dependent. This kind of nondeterministic specification is very difficult to evaluate for safety. Because of the tedium of this kind of checking, it is best done by automated tools. (Lastly, note that this type of mathematical completeness, while desirable, is not enough to guarantee any particular properties of the system. "True" is a mathematically complete, consistent, and deterministic specification, but it doesn't do anything.)
Criteria for value and timing assumptions include:
Criteria for the Human-Computer Interface include:
The answers to these questions have more of an impact on safety than it might first appear. For example, consider an air traffic control system where notices regarding controlled aircraft are displayed to the screen. What should happen if a notice might block the display of another aircraft's position? What if a notice is cleared from the display only to be replaced by another notice that looks almost identical (perhaps varying only in the flight number)? The operator might believe they had not cleared the first notice and clear the second as well. Human-computer interface designs that overwhelm the operator with tasks or sensory input may contribute to accidents.
Two examples of environment capacity constraints are:
Put simply, the output channel of the system must be able to keep up with the outputs generated as a result of the inputs to the system. If they cannot, some fallback plan must be in place. Recall, again, the line printer (for printing alarms) at Three Mile Island that fell several hours behind the status of the system.
Data age criteria are important as well:
Latency is the time interval during which receipt of new information cannot change an output even though it arrives prior to the output. This interval cannot be completely eliminated, but it can be influenced by hardware and software design, such as choosing an interrupt-driven architecture versus polling. The acceptable length of latency is determined by the process being controlled. Subtle problems can occur when the latency of data related to human-computer interaction isn't considered.
Feedback criteria are also important. Basic feedback loops, as defined by the process control function, must be included in the requirements along with appropriate checks to detect internal or external failures or errors. For example:
Paths between states are uniquely defined by the sequence of trigger events along the path. Transitions between modes are especially hazardous and susceptible to incomplete specification. There are several criteria related to the paths possible through the state space. Not all of the criteria are listed; the major categories are, with some examples under some of the categories.
In addition to completeness criteria checking, SpecTRM-RL facilitates the use of a number of other human and automated analysis techniques.
State machine hazard analysis starts from a hazardous configuration in the model. This configuration violates a safety constraint. The analysis traces backward until enough information is available to eliminate the hazard from the design.
Software deviation analysis is a new type of software requirements analysis. It's a forward robustness analysis: how will the software operate in an imperfect environment? Software deviation analysis (SDA) determines whether a hazardous software behavior can result from a class of input deviations, such as a measured aircraft velocity being too low (measured or assumed velocity is less than actual). SDA is based on qualitative mathematics. It partitions infinite domains (such as integers) into a small set of intervals. The intervals are used to simplify analysis compared to iterations over the entire state space.
The software deviation analysis procedure is completely automated. The analyst provides: input deviations to check, a SpecTRM-RL specification, and a list of safety-critical outputs. The output produced by the algorithm is a set of scenarios. Each scenario has a set of deviations in software input paths plus paths through the specification sufficient to lead to a deviation in a safety-critical output. The SDA procedure can optionally add further deviations as it executes that would, together with the original deviation, lead to unsafe output. This allows for analysis of multiple, independent deviations or failures.
Human Error Analysis examines the general requirements and design criteria as well as the hazard analysis for specific hazards. Human error analyses include mode confusion analysis. The idea is to look at the interaction between human controllers and the computer.
Executable specifications can be used as well as prototypes. They are easily changed, and at the end, they are a specification one can use in constructing the software. They can be reused, such as for product families and can be more easily analyzed. If the specification is a formal one, it can be analyzed. These executable specifications can even be used in hardware-in-the-loop or operator-in-the-loop simulations.
Copyright © 2003 Safeware Engineering Corporation. All rights reserved