Accident causes are often oversimplified:
The vessel Baltic Star, registered in Panama, ran aground at full speed on the shore of an island in Stockholm waters on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in English to the sailor who was tending the rudder. The latter was hard of hearing and understood only Greek.LeMonde
Often accidents are followed by investigation to determine the single, sole cause of the accident. Accidents are blamed on operator error, mechanical failure, or some other single cause. As shown above, often there are multiple contributing causes in an inter-related web. Even in the more detailed explanation above, larger economic and organization factors were ignored. Why are schedules tight enough in the shipping industry that the ship was at full speed under heavy fog? Why were the maladjusted compass and broken boilers not fixed? These and other questions may also uncover contributory causes.
The causes that one really wants to uncover are the root causes. These are the factors that, if changed, could prevent many other incidents and accidents from occurring. One common root cause is a flaw in the safety culture of the organization. Safety culture is the general attitude and approach to safety reflected by those who participate in an industry or organization, including management, workers, and government regulators.
Safety cultures are vulnerable to overconfidence and complacency. Safety is a difficult property to measure in that safety leads to a lack of accidents and incidents. The longer a successful safety program has been in effect, the less important or relevant it seems, due to its own past success. Once overconfidence and complacency set in, risks are discounted as being less likely than they are. There is an over reliance on redundancy and unrealistic risk assessments are performed. Events with a low probability of occurrence, but high consequences, tend to be ignored as something that couldn't happen at all. Similarly, complacent safety efforts act as though risk somehow decreases over time. If a system has worked ten times, then it seems somehow less likely that the system will have an accident in the eleventh use. Software-related risks are also underestimated. Perhaps the worst consequence is that warning signs are ignored. Incidents are ignored with the belief that everything is under control, an accident couldn't happen. This complacency and overconfidence leads to the an environment conducive for an accident to happen.
The following example demonstrates the consequences of an unrealistic risk assessment.
Risk is a function of the likelihood of an event occurring and the severity of the consequences. It is impossible to measure risk accurately. Instead, risk assessment techniques are used. The accuracy of such assessments is controversial.
To avoid the paralysis resulting from waiting for definitive data, we assume we have greater knowledge than scientists actually possess and make decisions based on those assumptions.William Ruckleshaus
It is not possible to measure the probability of very rare events directly. For example, to estimate the failure rate of nuclear power plants, one does not build a power plant, "run it for ten thousand years very quickly", and then tally up the resulting data. Instead, analysts use models of the interaction of events that can lead to an accident.
Risk modeling has several limitations. In practice, the models can only include events that can be measured. Most causal factors involved in major accidents are not measurable. When focusing on risk models, immeasurable factors tend to be ignored, forgotten, or given risk numbers with no basis. For software components, risk may not even be measurable; how does one measure the quality of design?
Risk assessment data can be like the captured spy; if you torture it long enough, it will tell you anything you want to know.William Ruckleshaus
Another potential flaw in a safety culture is a low priority assigned to safety. If organizational support is not provided to the system safety effort, it cannot succeed. Safety may also be compromised by flawed resolution of conflicting goals.
Ineffective Organization Structure
Even if the safety culture is inclined to support the safety engineering process, ineffective organizational structure may hamper system safety efforts.
Ineffective Technical Activities
Operator error is a frequently cited as an accident cause. Often, operator error is cited as the sole accident cause. However, data about operator effects on accident rates may be biased and incomplete. Positive actions by operators are rarely recorded. For example, when a plane crashes, often the pilot is blamed for the accident. However, there are numerous instances of pilot averting potential accidents; these pilots are termed to be doing their jobs.
Blame may be based on the premise that operators can overcome every emergency. But this is a myth born of wishful thinking. During normal operation, process automation can control the system. The operators have to intervene as the limits of the system's operating ability. The assumptions used to build the system and predict its behavior may break down at these extremes.
Further, hindsight is always 20/20. It is easy to describe in detail what the operators should have done after an analysis of the accidents reveals the system state at the time of the accident. Often, the operators of a system have limited access to the system state and are forced to draw reasonable, though potentially flawed, conclusions.
Separating operator error from design error may not be possible. The operator is forced to work with the system interface provided by the system designers. If the designers were not careful, the displays of the system may not carry critical information, or the controls the operator has over the system may be insufficient to bring system state from an unsafe state back into a safe state. Distinguishing between and operator's failure to act appropriately and a designer's failure to provide the operator the feedback and control to be able to act appropriately is difficult and perhaps impossible.
The figures above show drawings of actual system layouts and human machine interfaces.
An A-320 accident while landing at Warsaw was blamed on the pilots for landing too fast. Was it that simple?
Blaming the pilots turns attention away from several questions:
Automation does not eliminate human error, nor does it remove humans from systems. Automation simply moves humans to different functions. Humans take on the roles of design, programming, high-level supervision, high-level decision-making, and maintenance. Decision-making is more difficult at these higher and further removed levels because of system complexity and reliance on indirect information.
Automated systems on aircraft have eliminated some types of human error and created some new ones. While automation is often added to systems with the stated goal of reducing the required human skill level, often the skill levels and knowledge required may go up. Recall that operators intervene at the limits of the system's operation. Adding automation may merely force the operator to understand not only the controlled process but the automation controlling it. The correct partnership and allocation of tasks between the human operator and the automation is difficult. Who should have the final authority?
Computers do not produce new sorts of errors. They merely provide new and easier opportunities for making the old errors.Trevor Kletz, "Wise After the Event"
Some have advocated simply removing humans from the loop altogether. If human operators are so often blamed for accidents, removing the human element should solve the problem. In many cases, the technology exists to replace the operator with yet more automation. However, automation simply moves the task of dealing with unusual circumstances back from the operator to the designer of the automation. Not all conditions (or the correct way to deal with them) are foreseeable. Even those that can be predicated are programmed by error-prone human beings.
Many of the same limitations of human operators are characteristic of designers:
In reality, having a human operator provides a number of advantages. Human operators are adaptable and flexible. Operators adapt both their goals and the means to achieve them. They use problem solving and creativity to cope with unusual and unforeseen circumstances, and human operators can exercise judgment. Humans are unsurpassed at recognizing patterns, making associative leaps, and operating in ill-structured, ambiguous environments. Human error is the inevitable side effect of this flexibility and adaptability. It must be recognized that the very qualities that lead to operator error are those that make operators valuable.
Designers continue to design automation with the assumption that human operators will stay out of the way, not try to understand the system, and not try solutions outside the prescriptions of the training manual. Human operators are inquisitive. They form mental models about what a system is doing. They will work with the system to try to validate or invalidate those models and revise them. In many cases, the operator's mental model may be closer to the actual functioning of the system. Designers deal in averages and ideals. Systems change through the construction process as well as design. After installation, operation, maintenance, and evolution, the actual functioning of the system may be different from the designer's model.
Copyright © 2003 Safeware Engineering Corporation. All rights reserved