The Difference Between Software Safety
and Hardware Safety
Software and digital automation introduces a new factor into
engineering complex systems that requires changes to existing
system engineering techniques and places new requirements on
accident models.
The uniqueness and power of the digital computer over other
machines stems from the fact that, for the first time, we have a
general-purpose machine.

We no longer need to build a mechanical or analog autopilot
from scratch, for example, but simply to write down the "design"
of an autopilot in the form of instructions or steps to
accomplish the desired goals. These steps are then loaded into
the computer, which, while executing the instructions, in effect
becomes the special-purpose machine (the autopilot). If
changes are needed, the instructions can be changed instead of
building a different physical machine from scratch. Software in
essence is the design of a machine abstracted from its physical
realization.
Machines that previously were physically impossible or
impractical to build become feasible, and the design of a
machine can be changed quickly without going through an entire
retooling and manufacturing process. In essence, the
manufacturing phase is eliminated from the lifecycle of these
machines: The physical parts of the machine can be reused,
leaving only the design and verification phases. The design
phase also has changed: Emphasis is placed only on the steps to
be achieved without having to worry about how those steps will
be realized physically.
These advantages of using computers (along with others
specific to particular applications, such as reduced size and
weight for airborne or spacecraft systems) have led to an
explosive increase in their use, including their introduction
into potentially dangerous systems. There are, however, some
potential disadvantages of using computers and some important
changes that their use introduces into the traditional
engineering process that are leading to new types of accidents
as well as difficulties in investigating accidents and
preventing them.
With computers, the design of a machine is usually created by
someone who is not an expert in its design. The autopilot design
expert, for example, decides how the autopilot should work, but
then provides that information to a software engineer who is an
expert in software design but not autopilots. It is the software
engineer who then creates the detailed design of the autopilot.
The extra communication step between the engineer and the
software developer is the source of the most serious problems
with software today.
It should not be surprising then that most errors found in
operational software can be traced to requirements flaws,
particularly incompleteness. Completeness is a quality often
associated with requirements but rarely defined. The most
appropriate definition in the context of this paper has been
proposed by Jaffe: Software requirements specifications are
complete if they are sufficient to distinguish the desired
behavior of the software from that of any other undesired
program that might be designed.
In addition, nearly all the serious accidents in which
software has been involved in the past 20 years can be traced to
requirements flaws, not coding errors. The software may reflect
incomplete or wrong assumptions about the operation of the
system components being controlled by the software or about the
operation of the computer itself. The problems may also stem
from unhandled controlled-system states and environmental
conditions. Thus simply trying to get the software "correct" in
terms of accurately implementing the requirements will not make
it safer in most cases. Basically the problems stem from the
software doing what the software engineer thought it should do
when that is not what the original design engineer wanted.
Integrated product teams and other project management schemes to
help with this communication are being used, but the problem has
not been solved.
Even if the requirements problem was solved, there are other
important problems that the use of software introduces into the
system safety equation. First, the "failure modes" of software
differ from physical devices. Software is simply the design of
the machine abstracted from its physical realization. Can
software "fail"? What does it mean for a design to fail?
Obviously, if the term failure has any meaning whatsoever in
this context, it has to be different than that implied by the
failure of a physical device. Most software-related accidents
stem from the operation of the software, not from its lack
of operation and usually that operation is exactly what the
software engineers intended. Thus event models as well as system
design and analysis methods that focus on classic types of
failure events will not apply to software. Confusion about this
point is reflected in the many fault trees containing boxes that
say "Software fails."
The third problem can be called the curse of flexibility.
The computer is so powerful and so useful because it has
eliminated many of the physical constraints of previous
machines. This is both its blessing and its curse: We no longer
have to worry about the physical realization of our designs, but
we also no longer have physical laws that limit the complexity
of our designs. Physical constraints enforce discipline on the
design, construction, and modification of our design artifacts.
Physical constraints also control the complexity of what we
build. With software, the limits of what is possible to
accomplish are different than the limits of what can be
accomplished successfully and safely -- the
limiting factors change from the structural integrity and
physical constraints of our materials to limits on our
intellectual capabilities. It is possible and even quite easy to
build software that we cannot understand in terms of being able
to determine how it will behave under all conditions. We can
construct software (and often do) that goes beyond human
intellectual limits. The result has been an increase in system
accidents stemming from intellectual unmanageability related to
interactively complex and tightly coupled designs that allow
potentially unsafe interactions to go undetected during
development.
One possible solution is to stretch our intellectual limits
by using mathematical modeling and analysis. Engineers make
extensive use of models to understand and predict the behavior
of physical devices. Although computer scientists realized that
software could be treated as a mathematical object over 30 years
ago, mathematical methods (called formal methods in
computer science) have not been widely used on software in
industry, although there have been some successes in using them
on computer hardware. There are several reasons for this lack of
use. The biggest problem is simply the complexity of such
models. Software has a very large number of states (a model we
created of the function computed by TCAS II, a collision
avoidance system required on most commercial aircraft in the
U.S., has upwards of 1040 states). Sheer numbers
would not be a problem if the states exhibited adequate
regularity to allow reduction in the complexity based on
grouping and equivalence classes. Unfortunately, application
software does not exhibit the same type of regularity found in
digital hardware.
Physical systems, of course, also have a large number of
states (in fact, often infinite) but physical continuity allows
the use of continuous math where one equation can describe an
infinite number of states. Software lacks that physical
continuity, and discrete math must be substituted in the place
of continuous math. Formal logic is commonly used to describe
the required characteristics of the software behavior. However,
specifications using formal logic may be the same size or even
larger than the code, more difficult to construct than the code,
and harder to understand than the code. Therefore, they are as
difficult and error-prone to construct, if not more so, than the
software itself. That doesn't mean they cannot be useful, but
they are not going to be a panacea for the problems. In
addition, the enormous number of states in most software means
that only a very small percentage can be tested and the lack of
continuity does not allow for sampling and interpolation of
behavior between the sampled states. Testing, therefore, is not
the solution to the problem either.
One limitation in the current formal models and tools is that
most require a level of expertise in discrete math that is
commonly attained only by those with advanced degrees in
computer science or applied math. More serious is the problem
that the models do not match the way engineers usually think
about their designs and therefore are difficult to review and
validate. The basic problems will not be solved by providing
tools that computer scientists can use to evaluate the required
behavior of the special-purpose machine being designed but to
provide tools that are usable by the expert in that machine's
design. At the same time, software engineers do need models and
tools to analyze the structural design of the software itself.
Thus, for each group, different tools and models are required.
One additional complication is simply the number of emergent
properties exhibited by software. Like most complex designs,
errors are more likely to be found in the interactions among the
software components than in the design of the individual
components. Emergent properties complicate the creation of
effective analysis methods.
The situation is not hopeless, however. We can use
software safely in our engineered systems. But to accomplish
this goal, engineers will have to enforce the same discipline on
the software parts of the system design that nature imposes on
the physical parts. In essence, system engineers must identify
the constraints necessary to ensure safe system behavior and
effectively communicate them to the software engineers who, in
turn, must enforce these behavioral constraints in their
software.
Safety, like any quality, must be built into the system
design. Software represents or is the system design.
The most effective way to ensure that a system will operate
safely is to build safety in from the start, which means that
system operation must not lead to a violation of the constraints
on safe operation. System accidents result from interactions
among components that lead to a violation of these constraints
-- in other words, from a lack of appropriate enforcement of
constraints on the interactions. Because software often acts as
a controller in complex systems, it embodies or enforces the
constraints by controlling the components and their
interactions. Software, then, can contribute to an accident by
not enforcing the appropriate constraints on behavior or by
commanding behavior that violates the constraints.
This accident model (a lack of constraint enforcement)
provides a much better description of the role software plays in
accidents than a "failure" model. That is, the requirement
for software to be safe is not that it never "fails" but that it
does not cause or contribute to a violation of any of the system
constraints on safe behavior. This observation leads to the
suggested approach to handling software in safety-critical
systems, i.e., first identify the constraints on safe system
behavior and then design the software to enforce those
constraints. In the case of a batch chemical reactor accident
that occurred in Great Britain, the safety constraint violated
involved thermal properties of the reactor, which implied a
further constraint on the relative positions of valves the
computer is required to open and close. This constraint could be
enforced either in the hardware (perhaps through a physical
interlock) or in the software or both. Software engineers have
many techniques to build software that enforces these
constraints with very high assurance -- a task that is
enormously easier than attempting to build perfect software
(that never "fails"). A forthcoming book by Nancy Leveson will
detail this process.