Home > White Papers > Software Safety    s

The Difference Between Software Safety and Hardware Safety

Software and digital automation introduces a new factor into engineering complex systems that requires changes to existing system engineering techniques and places new requirements on accident models.

The uniqueness and power of the digital computer over other machines stems from the fact that, for the first time, we have a general-purpose machine.

We no longer need to build a mechanical or analog autopilot from scratch, for example, but simply to write down the "design" of an autopilot in the form of instructions or steps to accomplish the desired goals. These steps are then loaded into the computer, which, while executing the instructions, in effect becomes the special-purpose machine (the autopilot). If changes are needed, the instructions can be changed instead of building a different physical machine from scratch. Software in essence is the design of a machine abstracted from its physical realization.

Machines that previously were physically impossible or impractical to build become feasible, and the design of a machine can be changed quickly without going through an entire retooling and manufacturing process. In essence, the manufacturing phase is eliminated from the lifecycle of these machines: The physical parts of the machine can be reused, leaving only the design and verification phases. The design phase also has changed: Emphasis is placed only on the steps to be achieved without having to worry about how those steps will be realized physically.

These advantages of using computers (along with others specific to particular applications, such as reduced size and weight for airborne or spacecraft systems) have led to an explosive increase in their use, including their introduction into potentially dangerous systems. There are, however, some potential disadvantages of using computers and some important changes that their use introduces into the traditional engineering process that are leading to new types of accidents as well as difficulties in investigating accidents and preventing them.

With computers, the design of a machine is usually created by someone who is not an expert in its design. The autopilot design expert, for example, decides how the autopilot should work, but then provides that information to a software engineer who is an expert in software design but not autopilots. It is the software engineer who then creates the detailed design of the autopilot.

The extra communication step between the engineer and the software developer is the source of the most serious problems with software today.

It should not be surprising then that most errors found in operational software can be traced to requirements flaws, particularly incompleteness. Completeness is a quality often associated with requirements but rarely defined. The most appropriate definition in the context of this paper has been proposed by Jaffe: Software requirements specifications are complete if they are sufficient to distinguish the desired behavior of the software from that of any other undesired program that might be designed.

In addition, nearly all the serious accidents in which software has been involved in the past 20 years can be traced to requirements flaws, not coding errors. The software may reflect incomplete or wrong assumptions about the operation of the system components being controlled by the software or about the operation of the computer itself. The problems may also stem from unhandled controlled-system states and environmental conditions. Thus simply trying to get the software "correct" in terms of accurately implementing the requirements will not make it safer in most cases. Basically the problems stem from the software doing what the software engineer thought it should do when that is not what the original design engineer wanted. Integrated product teams and other project management schemes to help with this communication are being used, but the problem has not been solved.

Even if the requirements problem was solved, there are other important problems that the use of software introduces into the system safety equation. First, the "failure modes" of software differ from physical devices. Software is simply the design of the machine abstracted from its physical realization. Can software "fail"? What does it mean for a design to fail? Obviously, if the term failure has any meaning whatsoever in this context, it has to be different than that implied by the failure of a physical device. Most software-related accidents stem from the operation of the software, not from its lack of operation and usually that operation is exactly what the software engineers intended. Thus event models as well as system design and analysis methods that focus on classic types of failure events will not apply to software. Confusion about this point is reflected in the many fault trees containing boxes that say "Software fails."

The third problem can be called the curse of flexibility. The computer is so powerful and so useful because it has eliminated many of the physical constraints of previous machines. This is both its blessing and its curse: We no longer have to worry about the physical realization of our designs, but we also no longer have physical laws that limit the complexity of our designs. Physical constraints enforce discipline on the design, construction, and modification of our design artifacts. Physical constraints also control the complexity of what we build. With software, the limits of what is possible to accomplish are different than the limits of what can be accomplished successfully and safely -- the limiting factors change from the structural integrity and physical constraints of our materials to limits on our intellectual capabilities. It is possible and even quite easy to build software that we cannot understand in terms of being able to determine how it will behave under all conditions. We can construct software (and often do) that goes beyond human intellectual limits. The result has been an increase in system accidents stemming from intellectual unmanageability related to interactively complex and tightly coupled designs that allow potentially unsafe interactions to go undetected during development.

One possible solution is to stretch our intellectual limits by using mathematical modeling and analysis. Engineers make extensive use of models to understand and predict the behavior of physical devices. Although computer scientists realized that software could be treated as a mathematical object over 30 years ago, mathematical methods (called formal methods in computer science) have not been widely used on software in industry, although there have been some successes in using them on computer hardware. There are several reasons for this lack of use. The biggest problem is simply the complexity of such models. Software has a very large number of states (a model we created of the function computed by TCAS II, a collision avoidance system required on most commercial aircraft in the U.S., has upwards of 1040 states). Sheer numbers would not be a problem if the states exhibited adequate regularity to allow reduction in the complexity based on grouping and equivalence classes. Unfortunately, application software does not exhibit the same type of regularity found in digital hardware.

Physical systems, of course, also have a large number of states (in fact, often infinite) but physical continuity allows the use of continuous math where one equation can describe an infinite number of states. Software lacks that physical continuity, and discrete math must be substituted in the place of continuous math. Formal logic is commonly used to describe the required characteristics of the software behavior. However, specifications using formal logic may be the same size or even larger than the code, more difficult to construct than the code, and harder to understand than the code. Therefore, they are as difficult and error-prone to construct, if not more so, than the software itself. That doesn't mean they cannot be useful, but they are not going to be a panacea for the problems. In addition, the enormous number of states in most software means that only a very small percentage can be tested and the lack of continuity does not allow for sampling and interpolation of behavior between the sampled states. Testing, therefore, is not the solution to the problem either.

One limitation in the current formal models and tools is that most require a level of expertise in discrete math that is commonly attained only by those with advanced degrees in computer science or applied math. More serious is the problem that the models do not match the way engineers usually think about their designs and therefore are difficult to review and validate. The basic problems will not be solved by providing tools that computer scientists can use to evaluate the required behavior of the special-purpose machine being designed but to provide tools that are usable by the expert in that machine's design. At the same time, software engineers do need models and tools to analyze the structural design of the software itself. Thus, for each group, different tools and models are required.

One additional complication is simply the number of emergent properties exhibited by software. Like most complex designs, errors are more likely to be found in the interactions among the software components than in the design of the individual components. Emergent properties complicate the creation of effective analysis methods.

The situation is not hopeless, however. We can use software safely in our engineered systems. But to accomplish this goal, engineers will have to enforce the same discipline on the software parts of the system design that nature imposes on the physical parts. In essence, system engineers must identify the constraints necessary to ensure safe system behavior and effectively communicate them to the software engineers who, in turn, must enforce these behavioral constraints in their software.

Safety, like any quality, must be built into the system design. Software represents or is the system design. The most effective way to ensure that a system will operate safely is to build safety in from the start, which means that system operation must not lead to a violation of the constraints on safe operation. System accidents result from interactions among components that lead to a violation of these constraints -- in other words, from a lack of appropriate enforcement of constraints on the interactions. Because software often acts as a controller in complex systems, it embodies or enforces the constraints by controlling the components and their interactions. Software, then, can contribute to an accident by not enforcing the appropriate constraints on behavior or by commanding behavior that violates the constraints.

This accident model (a lack of constraint enforcement) provides a much better description of the role software plays in accidents than a "failure" model. That is, the requirement for software to be safe is not that it never "fails" but that it does not cause or contribute to a violation of any of the system constraints on safe behavior. This observation leads to the suggested approach to handling software in safety-critical systems, i.e., first identify the constraints on safe system behavior and then design the software to enforce those constraints. In the case of a batch chemical reactor accident that occurred in Great Britain, the safety constraint violated involved thermal properties of the reactor, which implied a further constraint on the relative positions of valves the computer is required to open and close. This constraint could be enforced either in the hardware (perhaps through a physical interlock) or in the software or both. Software engineers have many techniques to build software that enforces these constraints with very high assurance -- a task that is enormously easier than attempting to build perfect software (that never "fails"). A forthcoming book by Nancy Leveson will detail this process.

Home Products Services Publications White Papers About Us

Copyright © 2003 - 2016 Safeware Engineering Corporation. All rights reserved