Home > White Papers > Software Hazard Analysis

Software Hazard Analysis

Subsystem hazard analysis (SSHA) examines subsystems to determine how their

bulletNormal performance
bulletOperational degradation
bulletFunctional failure
bulletUnintended function
bulletInadvertent function (proper function but at wrong time or in wrong order)

could contribute to system hazards. SSHA also determines how to satisfy design constraints on the subsystem design. Lastly, subsystem hazard analysis validates that the subsystem design satisfies safety design constraints and does not introduce previous unidentified hazardous system behavior.

Software hazard analysis is a form of subsystem hazard analysis. It validates that specified software blackbox behavior satisfies system safety design constraints. Software Hazard Analysis checks that specified software behavior satisfies general software system safety design criteria, as well. This analysis must be performed on all software in the system, including COTS.

Like system hazard analysis, software (subsystem) hazard analysis requires a model of the component's behavior. Using code is too hard, there is too much implementation complexity to look at the specification of behavior. Examination of code is also too late in the process. If changes must be made, by the time the code is written the effort will be too costly. Software is too complex to do analysis entirely in one's head.

Formal models are useful, but they need to be easily readable and usable without graduate-level training in discrete math. Only a small subset of errors are detectable by automated tools: the most important ones require human knowledge and expertise. Mathematical proofs can be developed based on formal systems, but these proofs must be understandable and checkable by application experts. Frequently the proofs produced are more complex and error-prone than the systems they describe. The hazard analysis process requires results that can be openly reviewed and discussed.

State Machine Models of Blackbox Requirements

State machines make a good model for describing and analyzing digital systems and software. State machines match intuitive notions of how machines work. Some other specification languages, such as those based on sets, do not. State machines have a mathematical basis, so they can be analyzed; they also have graphical notations that are easily understandable. Previous problems with state explosion in state machine models have been solved by "meta-modeling" languages so complex systems can be handled.

Some analyses can be automated and tools can help human analysts to traverse (search) the model. Our experience is that assisted search and understanding tools are the most helpful in hazard analysis. Completely automated tools have an important but more limited role to play.

An example state machine follows.

Requirements are the source of most operational errors and almost all software contributions to accidents. Much of software hazard analysis should therefore focus on requirements. The problem is dealing with complexity. One step in controlling complexity is to separate external behavior from complexity of internal design to accomplish the behavior.

Abstraction and metamodels can be used to handle the large number of discrete states required to describe software behavior. Continuous math (which works well with large ranges) is not available for help. But new types of state machine modeling languages drastically reduce the number of states and transitions that the modeler needs to describe.

Blackbox specifications provide a blackbox statement of software behavior. Statements are permitted only in terms of outputs and externally observable conditions or events that stimulate or trigger those outputs. A complete trigger specification must include the full set of conditions that may be inferred from the existence of the specified output. Such conditions represent assumptions about the environment in which the program or system is to operate. Thus, the specification is the input to output function computed by the component, i.e., the transfer function. Internal design decisions are not included.

Process models define the required blackbox behavior of the software in terms of a state machine model of the process (called the plant in control systems terminology).

Accidents occur when the three process models do not match and incorrect control commands are given (or correct ones are not given). How do these models become inconsistent?

bullet

Wrong from the beginning, e.g.
bullet

uncontrolled disturbances

bullet

unhandled process states

bullet

inadvertently commanding system into a hazardous state

bullet

unhandled or incorrectly handled system component failures

[Note that these are related to what we called system accidents.]

bullet

Missing or incorrect feedback, so the state is not updated correctly

bullet

Time lags not accounted for

The controller's model of the automation must also be accurate. But we often find that operators do not understand the automation. They often have questions like:

bullet

What did it just do?

bullet

Why did it do that?

bullet

What will it do next?

bullet

How did I get us into this state?

bullet

How do I get it to do what I want?

bullet

Why won't it let us do that?

bullet

What caused the failure?

bullet

What can we do so it does not happen again?

Operators also may not receive updates to their mental models, or they may disbelieve the feedback the system gives.

SpecTRM-RL Models

Level 3 of an intent specification contains a model. The model is constructed using the SpecTRM-RL modeling language. SpecTRM-RL has goals to be readable and reviewable. The model should minimize the semantic distance between the modeler and the system. The model should also be a minimal model, including only blackbox behavior and not internal design. The SpecTRM-RL modeling language is easy to learn and has unambiguous, simple semantics. Lastly, the language is analyzable, including execution and formal analysis.

SpecTRM-RL combines utility of requirements specification languages and modeling languages. The language is based on state machines but the syntax is very readable. The language includes or enforces most of the completeness criteria developed for safe software system development. SpecTRM-RL supports specifying systems in terms of modes: control and supervisory.

In SpecTRM-RL, the process is modeled using state variables. And example of two state variables is shown in the figure below.

In a graphical depiction, the SpecTRM model is written as shown in the figure below. Like control system theory, the block diagram is written with the system in the center. Inputs are above in the delay (or to the left) and outputs are down and to the right.

Because of SpecTRM-RL's formal foundation, many analyses can be applied to the model, including:

bulletModel Execution, Animation, and Visualization
bulletCompleteness
bulletState Machine Hazard Analysis (backwards reachability)
bulletSoftware Deviation Analysis
bulletHuman Error Analysis
bulletTest Coverage Analysis and Test Case Generation

In theory, it may be possible to generate code directly from the requirements.

SpecTRM-RL models are executable, and model execution is animated in the graphical display of the model. The results of the execution could be passed into a graphical visualization, showing the system in operation. Inputs can come from another model or simulator and output can go into another model or simulator.

Completeness analysis is another desirable benefit from SpecTRM-RL models. Most software-related accidents involve software requirements deficiencies. Accidents often result from unhandled and unspecified cases. We have defined a set of criteria to determine whether a requirements specification is complete. These completeness requirements are derived from basic engineering principles. These criteria have been validated (at JPL) and used on industrial projects. Completeness can be defined as the property that requirements are sufficient to distinguish the desired behavior of the program from that of any other undesired program that might be designed.

Requirements Completeness Criteria

The completeness criteria were derived by mapping the parts of a control loop to a state machine. Completeness for each part of the state machine (states, inputs, outputs, transitions) was defined. Basic engineering principles (e.g., feedback) were added as well. Additional criteria have been added based on lessons learned from accidents. We have about 60 criteria in all, including human-computer interaction. There are too many for all of them to be included in this article (although some will be, for demonstration). Most of these criteria are integrated into the SpecTRM-RL language design, so that writing a model in SpecTRM-RL forces the criteria to be addressed. Many can also be checked by simple tools that we are in the process of developing.

The completeness criteria address:

bulletStartup, shutdown
bulletMode transitions
bulletInputs and outputs
bulletValue and timing
bulletLoad and capacity
bulletEnvironment capacity
bulletFailure states and transitions
bulletHuman-computer interaction
bulletRobustness
bulletData age
bulletLatency
bulletFeedback
bulletReversibility
bulletPreemption
bulletPath Robustness

To use startup and state completeness as an example, many accidents involve off-nominal processing modes, including startup and shutdown and handling unexpected inputs. Examples of completeness criteria in this category are:

bulletThe internal software model of the process must be updated to reflect the actual process state at initial startup and after temporary shutdown.
bulletAccidents have been caused by computer controllers starting up with an implicit assumption about the state that the process is in. When these computers are taken off-line and brought back on-line in the middle of a process, the discontinuity between the controller's model and the real world causes an accident.
bulletThe maximum time the computer waits before the first input must be specified.
bulletOften, it is easy to design computers to be reactive. They exert control when prompted by input feeds. But if a computer controller receives no input for a long time after startup, it may indicate a problem with the input sources.
bulletThere must be a response specified for the arrival of an input in any state, including indeterminate states.
bulletIf there is no defined response for the arrival of inputs, even when those inputs aren't expected, then the system may behave erratically if it received an input at the wrong time.

Failure states and transition criteria require that the following need to be completely specified.

bulletOff-nominal states and transitions
bulletPerformance degradation
bulletCommunication with the operator about fail-safe behavior
bulletPartial shutdown and restart
bulletHysteresis in transitions between off-nominal and nominal

Most accidents occur while in off-nominal processing modes. As a brief example, two plants from the nuclear industry, Three-Mile Island and Chernobyl illustrate two of the points above. At the Three-Mile Island accident, the line printer that printed errors fell hours behind the state of the system, illustrating that communication with the operator during failure modes must be considered. Another point, that off-nominal states and transitions must be considered, comes up in the Chernobyl accident, where the operators were running a test with the reactor at the time the accident began.

Criteria for input and output variable completeness were mentioned above as well. At the blackbox interface to the software, only time and values are observable to the software. So, triggers for outputs and output values must be defined only as constants or as the value of observable events or conditions. The completeness criteria are:

bulletAll information from the sensors should be used somewhere in the specification.
bullet

If an input value is never used, why include it in the specification?

bullet

Legal output values that are never produced should be checked for potential specification incompleteness.
bullet

Outputs are initially derived from examining the process to be controlled and determining what outputs are needed to effect that control. If, upon review, it is found that an output is never produced, something is likely amiss.

Trigger events have their own completeness criteria. The behavior of the computer should be well defined with respect to assumptions about the behavior of the other parts of the system. A robust system will detect and respond appropriately to violations of these assumptions (unexpected inputs). Therefore, robustness of software built from specification will depend on completeness of specification of environmental assumptions. This criteria can be succinctly written that:

bulletThere should be no observable events that leave the program's behavior indeterminate.

It is very important to have the assumptions on inputs documented and checked.

To be robust, the events that trigger state changes must satisfy the following:

  1. Every state must have a behavior (transition) defined for possible input.

  2. The logical OR of the conditions on every transition out of every state must form a tautology.

  3. Every state much have a software behavior (transition) defined in case there is no input for a given period of time (a timeout).

Together these criteria guarantee handling of input that are within range, out of range, and missing.

The behavior of the requirements should be deterministic. That is, only one possible transition out of a state should be applicable at any given time. For instance, if one transition is taken when X > 0 and another when X < 2, then which one is actually taken at any given time is implementation dependent. This kind of nondeterministic specification is very difficult to evaluate for safety. Because of the tedium of this kind of checking, it is best done by automated tools. (Lastly, note that this type of mathematical completeness, while desirable, is not enough to guarantee any particular properties of the system. "True" is a mathematically complete, consistent, and deterministic specification, but it doesn't do anything.)

Criteria for value and timing assumptions include:

bullet

All inputs should be checked and a response specified in the event of an out-of-range or unexpected value.

bullet

All inputs must be fully bounded in time and the proper behavior specified in case the limits are violated.

bullet

Minimum and maximum load assumptions must be specified and a proper behavior specified in case the assumptions are violated.

bullet

A minimum-arrival-rate check should be required for each physically distinct communication path. Software should have the capability to query its environment with respect to inactivity over a given communication path.

Criteria for the Human-Computer Interface include:

bullet

For every data item displayable to a human, the specification must include
bullet

What events cause this item to be displayed?

bullet

What events cause this item to be updated?

bullet

What events should cause the display to disappear?

bullet

For queues, the specification must include:

  1. Events to be queued

  2. Type and number of queues to be provided. For example, a routine queue of notifications may be separated from a queue of alerts.

  3. Ordering scheme within the queue, such as priority ordering or time of arrival

  4. Operator notification mechanism for items inserted in the queue

  5. Operator review and disposal commands for queue entries

  6. Queue entry deletion

The answers to these questions have more of an impact on safety than it might first appear. For example, consider an air traffic control system where notices regarding controlled aircraft are displayed to the screen. What should happen if a notice might block the display of another aircraft's position? What if a notice is cleared from the display only to be replaced by another notice that looks almost identical (perhaps varying only in the flight number)? The operator might believe they had not cleared the first notice and clear the second as well. Human-computer interface designs that overwhelm the operator with tasks or sensory input may contribute to accidents.

Two examples of environment capacity constraints are:

bullet

For the largest interval in which both input and output loads are assumed and specified, the absorption rate of the output environment must equal or exceed the input arrival rate.

bullet

Contingency action must be specified when the output absorption rate will be exceeded.

Put simply, the output channel of the system must be able to keep up with the outputs generated as a result of the inputs to the system. If they cannot, some fallback plan must be in place. Recall, again, the line printer (for printing alarms) at Three Mile Island that fell several hours behind the status of the system.

Data age criteria are important as well:

bullet

All inputs used in specifying the output events must be properly limited in the time they can be used.

bullet

Output commands that may not be able to be executed immediately must be limited in the time they are valid.

bullet

Incomplete hazardous action sequences (transactions) should have a finite time specified after which the software should be required to cancel the sequence automatically and inform the operator.

bullet

Revocation of partially completed transactions may require:

  1. Specification of multiple times and conditions under which varying automatic cancellation or postponement actions are taken without operator confirmation.

  2. Specification of operator warnings to be issued in case of such revocation.

Latency is the time interval during which receipt of new information cannot change an output even though it arrives prior to the output. This interval cannot be completely eliminated, but it can be influenced by hardware and software design, such as choosing an interrupt-driven architecture versus polling. The acceptable length of latency is determined by the process being controlled. Subtle problems can occur when the latency of data related to human-computer interaction isn't considered.

Feedback criteria are also important. Basic feedback loops, as defined by the process control function, must be included in the requirements along with appropriate checks to detect internal or external failures or errors. For example:

bullet

There should be some input that the software can use to detect the effect of each output on the process.

bullet

Every output for which a detectable feedback input is expected must have associated with it:

  1. A requirement to handle the normal response.

  2. Requirements to handle a response that is missing, too late, too early, or has an unexpected value.

Paths between states are uniquely defined by the sequence of trigger events along the path. Transitions between modes are especially hazardous and susceptible to incomplete specification. There are several criteria related to the paths possible through the state space. Not all of the criteria are listed; the major categories are, with some examples under some of the categories.

bullet

Reachability
bullet

Required states must be reachable from the initial state.

bullet

Hazardous states must not be reachable.

bullet

Complete reachability analysis is often impractical, but it may be possible to reduce the search space by focusing on a few properties or using a backward search.

bullet

Sometimes what is not practical in the general case is practical in specific cases. Thus, while the forward search for reachability may lead to state explosion in the general case, some kinds of forward reachability search prove practical.

bullet

Recurrent Behavior
bullet

Most process control software is cyclic. May have some non-cyclic states (mode change, shutdown)

bullet

Required sequences of events must be specified and limited by transitions in a cycle.

bullet

An inhibiting state is a state from which output cannot be generated.

bullet

There should be no states that inhibit later required outputs.

bullet

Reversibility

bullet

Preemption

bullet

Path Criteria

Soft Failure mode:

The loss of ability to receive input X could inhibit the production of output Y.

Hard Failure mode:

The loss of ability to receive input X will inhibit the production of output Y.

bulletSoft and hard failure modes should be eliminated for all hazard reducing outputs.
bulletHazard increasing outputs should have both soft and hard failure modes.
bulletMultiple paths should be provided for state changes that maintain safety.
bullet

Multiple inputs or triggers should be required for paths from safe to hazardous states.

bullet

Constraint Analysis

Requirements Analyses

In addition to completeness criteria checking, SpecTRM-RL facilitates the use of a number of other human and automated analysis techniques.

State machine hazard analysis starts from a hazardous configuration in the model. This configuration violates a safety constraint. The analysis traces backward until enough information is available to eliminate the hazard from the design.

Software deviation analysis is a new type of software requirements analysis. It's a forward robustness analysis: how will the software operate in an imperfect environment? Software deviation analysis (SDA) determines whether a hazardous software behavior can result from a class of input deviations, such as a measured aircraft velocity being too low (measured or assumed velocity is less than actual). SDA is based on qualitative mathematics. It partitions infinite domains (such as integers) into a small set of intervals. The intervals are used to simplify analysis compared to iterations over the entire state space.

The software deviation analysis procedure is completely automated. The analyst provides: input deviations to check, a SpecTRM-RL specification, and a list of safety-critical outputs. The output produced by the algorithm is a set of scenarios. Each scenario has a set of deviations in software input paths plus paths through the specification sufficient to lead to a deviation in a safety-critical output. The SDA procedure can optionally add further deviations as it executes that would, together with the original deviation, lead to unsafe output. This allows for analysis of multiple, independent deviations or failures.

Human Error Analysis examines the general requirements and design criteria as well as the hazard analysis for specific hazards. Human error analyses include mode confusion analysis. The idea is to look at the interaction between human controllers and the computer.

Executable specifications can be used as well as prototypes. They are easily changed, and at the end, they are a specification one can use in constructing the software. They can be reused, such as for product families and can be more easily analyzed. If the specification is a formal one, it can be analyzed. These executable specifications can even be used in hardware-in-the-loop or operator-in-the-loop simulations.

Home Products Services Publications White Papers About Us

Copyright © 2003 - 2016 Safeware Engineering Corporation. All rights reserved