Home > White Papers > Safety & Reliability     

 

Safety is not Reliability

Accidents in high-tech systems are changing their nature, and we must change our approaches to safety accordingly. Safety and reliability are often confused as being equivalent.

From an FAA report on ATC software architectures:

"The FAA's en route automation meets the criteria for consideration as a safety-critical system. Therefore, en route automation systems must possess ultra-high reliability."

From a blue ribbon panel report on the V-22 Osprey problems:

"Safety [software]: ...
Recommendation: Improve reliability, then verify by extensive test/fix/test in challenging environments."

Consider the following definitions.

Failure:

Nonperformance or inability of system or component to perform its intended function for a specified time under specified environmental conditions.

Failure is a basic abnormal occurrence, e.g.:

bulleta burned out bearing in a pump
bulleta relay not closing properly when voltage is applied
Fault:

High-order events, e.g.,

bulleta relay closes at the wrong time due to improper functioning of an upstream component

All failures are faults, but not all faults are failures. Under these definitions, what does it mean for software to fail. Software does not wear out, and it does not suffer from random, statistically-describable faults in manufacturing. A digital copy made from the original software is exactly the same. Software is a pure artifact of design. Assuming the software is deterministic, it will always perform the same way in the same circumstances. So is it even appropriate to say that software has failed? The errors in software are errors in design, not wear-out or manufacturing defect.

 

The Reliability Approach to Safety

Reliability:
The probability an item will perform its required function in the specified manner over a given time period and under specified or assumed conditions.

Note that most accidents result from errors in specified requirements or function and deviations from assumed conditions. That is, most accidents are not the result of unreliability.

Reliability is concerned primarily with failures and failure rate reduction. Reliability uses methods such as:

bulletParallel redundancy
bulletStandby sparing
bulletSafety factors and margins
bulletDerating
bulletScreening
bulletTimed replacements

The application of these techniques to safety assumes that accidents are the result of component failure. On the good side, techniques exist to increase component reliability. Failure rates in hardware are quantifiable. Unfortunately, this approach omits important factors in accidents, and may even decrease safety. Many accidents occur without any component failure. For example, accidents may be caused by equipment operation outside parameters and time limits upon which reliability analyses are based. Accidents may also be caused by interactions of components all operating according to specification.

In summary, highly reliable components are not necessarily safe. A somewhat whimsical example expresses the idea quite simply. The most reliable weapon system ever built might be the most unsafe. A highly reliable weapon will fire whenever the trigger is pulled, but it may well accomplish this goal by firing on many other occasions as well. A very safe weapon would not fire at all, which makes it a very unreliable system.

Standard engineering techniques of preventing failures through redundancy, increasing component reliability, and reuse of designs won't work for software and system accidents. Redundancy simply makes complexity worse; any solutions that involve adding complexity will not solve problems that stem from intellectual unmanageability and interactive complexity. The majority of software-related accidents are caused by requirements errors. Even if an accident is caused by a software implementation error, it could not be prevented by redundancy. Software errors are not caused by random wear out failures, so different copies of the same software will have the same defects.

Increasing software reliability or integrity is appearing in many new international standards for software safety. Sometimes software is given reliability numbers (such as 10-9), particularly when software is a component in a quantitative fault tree analysis. No good justification for this reliability number has been put forth. It is doubtful that software reliability can even be measured. What would it mean? Every copy of the software is identical, so it can't refer to the probability of any kind of wear out in the software.

Safety involves more than just getting software "correct". For example, consider an altitude switch that reads from three altimeters. When the switch passes below a threshold altitude, some action is taken. Is safety increased by requiring that the action happen if any of the three altimeters reports below threshold, or if all three report below threshold? The software is "correct" if it meets the requirements, but which way of writing the requirements is safer? This is an emergent property dependent on the system as a whole to provide context for making the decision.

Software component reuse is one of the most common factors in software-related accidents. Software contains assumptions about its environment. Accidents occur when these assumptions are incorrect. The Therac-25, Ariane 5, and U.K. ATC software are all examples of reused software components that caused accidents. In each case, the new operating environment violated some assumption of the software.

One implication of this is that COTS actually makes safety analysis more difficult. The COTS must be treated every bit as carefully in the system safety effort as any other component. This is complicated because must be designed, for marketability, to be general enough to fit several systems. Furthermore, vendors may not wish to disclose the information necessary to perform system safety analyses.

 

Home Products Services Publications White Papers About Us

Copyright 2003 Safeware Engineering Corporation. All rights reserved