Home > White Papers > Accident Causes

  Accident Causes

My company has had a safety program for 150 years. The program was instituted as a result of a French law requiring an explosives manufacturer to live on the premises with his family.

- Crawford Greewalt    (former president of Dupont)

Most accidents are not the result of unknown scientific principles, but rather of a failure to apply well-known standard engineering practices.

 - Trevor Kletz

Causality

Accident causes are often oversimplified:

The vessel Baltic Star, registered in Panama, ran aground at full speed on the shore of an island in Stockholm waters on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in English to the sailor who was tending the rudder. The latter was hard of hearing and understood only Greek.
LeMonde

Often accidents are followed by investigation to determine the single, sole cause of the accident. Accidents are blamed on operator error, mechanical failure, or some other single cause. As shown above, often there are multiple contributing causes in an inter-related web. Even in the more detailed explanation above, larger economic and organization factors were ignored. Why are schedules tight enough in the shipping industry that the ship was at full speed under heavy fog? Why were the maladjusted compass and broken boilers not fixed? These and other questions may also uncover contributory causes.

The causes that one really wants to uncover are the root causes. These are the factors that, if changed, could prevent many other incidents and accidents from occurring. One common root cause is a flaw in the safety culture of the organization. Safety culture is the general attitude and approach to safety reflected by those who participate in an industry or organization, including management, workers, and government regulators.

Safety Culture

Safety cultures are vulnerable to overconfidence and complacency. Safety is a difficult property to measure in that safety leads to a lack of accidents and incidents. The longer a successful safety program has been in effect, the less important or relevant it seems, due to its own past success. Once overconfidence and complacency set in, risks are discounted as being less likely than they are. There is an over reliance on redundancy and unrealistic risk assessments are performed. Events with a low probability of occurrence, but high consequences, tend to be ignored as something that couldn't happen at all. Similarly, complacent safety efforts act as though risk somehow decreases over time. If a system has worked ten times, then it seems somehow less likely that the system will have an accident in the eleventh use. Software-related risks are also underestimated. Perhaps the worst consequence is that warning signs are ignored. Incidents are ignored with the belief that everything is under control, an accident couldn't happen. This complacency and overconfidence leads to the an environment conducive for an accident to happen.

The following example demonstrates the consequences of an unrealistic risk assessment.

Design:
The system design included a relief valve opened by an operator to protect against over pressurization. A secondary valve was installed as a backup in case the primary valve failed. The operator must know if the first valve did not open so that the second valve could be activated.
Events:
The open position indicator light and open indicator light both illuminated. However, the primary valve was not open, and the system exploded.
Causal Factors:
Post-accident examination discovered the indicator light circuit was wired to indicate presence of power at the valve, but it did not indicate valve position. Thus, the indicator showed only that the activation button had been pushed, not that the valve had opened. An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but ignored the possibility of design error in the electrical wiring; the probability of design error was not quantifiable. No safety evaluation of the electrical wiring was made; instead confidence was established on the basis of the low probability of coincident failure of the two relief valves.

Risk is a function of the likelihood of an event occurring and the severity of the consequences. It is impossible to measure risk accurately. Instead, risk assessment techniques are used. The accuracy of such assessments is controversial.

To avoid the paralysis resulting from waiting for definitive data, we assume we have greater knowledge than scientists actually possess and make decisions based on those assumptions.
William Ruckleshaus

It is not possible to measure the probability of very rare events directly. For example, to estimate the failure rate of nuclear power plants, one does not build a power plant, "run it for ten thousand years very quickly", and then tally up the resulting data. Instead, analysts use models of the interaction of events that can lead to an accident.

Risk modeling has several limitations. In practice, the models can only include events that can be measured. Most causal factors involved in major accidents are not measurable. When focusing on risk models, immeasurable factors tend to be ignored, forgotten, or given risk numbers with no basis. For software components, risk may not even be measurable; how does one measure the quality of design?

Risk assessment data can be like the captured spy; if you torture it long enough, it will tell you anything you want to know.
William Ruckleshaus
Risk in a Free Society

Another potential flaw in a safety culture is a low priority assigned to safety. If organizational support is not provided to the system safety effort, it cannot succeed. Safety may also be compromised by flawed resolution of conflicting goals.

Ineffective Organization Structure

Even if the safety culture is inclined to support the safety engineering process, ineffective organizational structure may hamper system safety efforts.

bulletDiffusion of responsibility and authority can leave the burden of ensuring safety on individuals who do not have the authority to carry out their responsibilities. This is often a problem if safety personnel are accorded a low-level status.
bulletSafety efforts are also hampered if safety personnel lack independence. When safety personnel report to the same authority that is responsible for budget and schedule considerations, there is some likelihood that schedule or budgetary pressures will override safety.
bulletLimited communication channels and poor information flow can prevent safety personnel from interacting with system designers to ensure system safety.

Ineffective Technical Activities

bulletSuperficial safety efforts are not sufficient to ensure system safety.
bulletIneffective risk control can result in unsafe systems. This includes
bulletFailing to eliminate basic design flaws
bulletBasing safeguards on false assumptions
bulletAllowing uncontrolled complexity in the system design
bulletUsing risk control devices to reduce safety margins. Often, new safety technologies are used as a reason operate processes closer to their limits. Reactions are run in larger batches, machinery is run faster, and so on. Reducing safety margins can lead to accidents.
bulletOften, failure to properly evaluate system changes leads to accidents. Most systems are modified after their manufacture and installation. These modifications require a new safety analysis of the system to ensure that the changes don't make the system unsafe. Failure to examine the safety implications of jury-rigging systems has caused a number of accidents.
bulletInformation deficiencies can also lead to accidents.

Operators

Operator error is a frequently cited as an accident cause. Often, operator error is cited as the sole accident cause. However, data about operator effects on accident rates may be biased and incomplete. Positive actions by operators are rarely recorded. For example, when a plane crashes, often the pilot is blamed for the accident. However, there are numerous instances of pilot averting potential accidents; these pilots are termed to be doing their jobs.

Blame may be based on the premise that operators can overcome every emergency. But this is a myth born of wishful thinking. During normal operation, process automation can control the system. The operators have to intervene as the limits of the system's operating ability. The assumptions used to build the system and predict its behavior may break down at these extremes.

Further, hindsight is always 20/20. It is easy to describe in detail what the operators should have done after an analysis of the accidents reveals the system state at the time of the accident. Often, the operators of a system have limited access to the system state and are forced to draw reasonable, though potentially flawed, conclusions.

Separating operator error from design error may not be possible. The operator is forced to work with the system interface provided by the system designers. If the designers were not careful, the displays of the system may not carry critical information, or the controls the operator has over the system may be insufficient to bring system state from an unsafe state back into a safe state. Distinguishing between and operator's failure to act appropriately and a designer's failure to provide the operator the feedback and control to be able to act appropriately is difficult and perhaps impossible.

The figures above show drawings of actual system layouts and human machine interfaces.

An A-320 accident while landing at Warsaw was blamed on the pilots for landing too fast. Was it that simple?

bulletThe pilots were told to expect wind shear. In response, they landed faster than normal to give extra stability and lift.
bulletThe meteorological information was out of date; there was no wind shear by the time the pilots landed.
bulletThe Polish government's meteorologist was supposedly in the bathroom as the time of the landing. (We have not confirmed this.)
bulletThere was a thin film of water that had not been cleared from the runway.
bulletThe wheels aquaplaned, skimming the surface, without gaining enough rotary speed to tell the computer braking systems that the aircraft was landing.
bulletThe computers refused to allow the pilots to use the aircraft's braking systems, so the plane did not break until too late.
bulletThe accident still would not have been catastrophic if there were not a high bank built at the end of the runway.
bulletThe aircraft crashed into the bank and broke up.

Blaming the pilots turns attention away from several questions:

bulletWhy were the pilots given out-of-date weather information?
bulletWhy did the design of the computer-based braking system cause it to ignore the pilots commands?
bulletWhy were the pilots not able to manually apply the braking system, bypassing the computer? (Who should have final authority, the pilots, or the automation?)
bulletWhy was the plane permitted to land with water on the runway?
bulletWhy was the decision made to build a bank at the end of the runway?

Automation does not eliminate human error, nor does it remove humans from systems. Automation simply moves humans to different functions. Humans take on the roles of design, programming, high-level supervision, high-level decision-making, and maintenance. Decision-making is more difficult at these higher and further removed levels because of system complexity and reliance on indirect information.

Automated systems on aircraft have eliminated some types of human error and created some new ones. While automation is often added to systems with the stated goal of reducing the required human skill level, often the skill levels and knowledge required may go up. Recall that operators intervene at the limits of the system's operation. Adding automation may merely force the operator to understand not only the controlled process but the automation controlling it. The correct partnership and allocation of tasks between the human operator and the automation is difficult. Who should have the final authority?

Computers do not produce new sorts of errors. They merely provide new and easier opportunities for making the old errors.
Trevor Kletz, "Wise After the Event"

Some have advocated simply removing humans from the loop altogether. If human operators are so often blamed for accidents, removing the human element should solve the problem. In many cases, the technology exists to replace the operator with yet more automation. However, automation simply moves the task of dealing with unusual circumstances back from the operator to the designer of the automation. Not all conditions (or the correct way to deal with them) are foreseeable. Even those that can be predicated are programmed by error-prone human beings.

Many of the same limitations of human operators are characteristic of designers:

bulletDifficulty in assessing the probabilities of rare events.
bulletBias against considering side effects.
bulletTendency to overlook contingencies.
bulletLimited capacity to comprehend complex relationships.
bulletPropensity to control complexity by concentrating only on a few aspects of the system.

In reality, having a human operator provides a number of advantages. Human operators are adaptable and flexible. Operators adapt both their goals and the means to achieve them. They use problem solving and creativity to cope with unusual and unforeseen circumstances, and human operators can exercise judgment. Humans are unsurpassed at recognizing patterns, making associative leaps, and operating in ill-structured, ambiguous environments. Human error is the inevitable side effect of this flexibility and adaptability. It must be recognized that the very qualities that lead to operator error are those that make operators valuable.

Designers continue to design automation with the assumption that human operators will stay out of the way, not try to understand the system, and not try solutions outside the prescriptions of the training manual. Human operators are inquisitive. They form mental models about what a system is doing. They will work with the system to try to validate or invalidate those models and revise them. In many cases, the operator's mental model may be closer to the actual functioning of the system. Designers deal in averages and ideals. Systems change through the construction process as well as design. After installation, operation, maintenance, and evolution, the actual functioning of the system may be different from the designer's model.

 

Home Products Services Publications White Papers About Us

Copyright © 2003 - 2016 Safeware Engineering Corporation. All rights reserved