Engineering software for safety critical systems can be a tough task. In all respects, it should be. The Therac-25 incidents are only examples of scenarios in which the development of safety critical software can go awry. Another example of a safety critical system might be the software that runs a nuclear power plant. Still other examples might be the fly by wire system in an avionics deployment, or the electronic parking break in an automobile. The failure of these systems can spell disaster to human life. As such, additional measures must be taken in the development of software that, if it fails, can cause loss of, or even damage to, human life.
As our text tells us, safety critical software must undergo a development and testing process that is much more rigorous and time consuming than the processes used in the development and testing of other types of software. The system must be coded carefully, inspected, documented, tested, verified, and analyzed. There must be a product safety engineer assigned to the system, a hazard log implemented and risk analysis performed as core developmental processes. (Reynolds 276) The software development process should not be carried out by a single software engineer, but rather by a properly organized team that can audit, and if needed, correct one another\'s work. (Reynolds 276) Also, engineers should not place too much trust and confidence in safety critical software. Doing so is one of the factors that led to the failures of the Therac-25. The sole responsibility for safety was delegated to the software. Standard hardware safety that had originally been included in the Therac-6 and Therac-20 machines was removed in the Therac-25 model and replaced by software monitoring, that under certain circumstances, failed. (Reynolds 291) The design of the system should somehow stop catastrophe from arising out of a single error in the software. These simple safety concepts should be standards that must be adhered to in order to properly develop safety critical software systems. By stating clear standards, under which a safety critical system is required to be developed, we can limit the damage to human life created through faulty system design.
The damage that occurred through the use of the Therac-25 were, unfortunately, not limited to a single incident. The incidents happened over a period of 19 months. (Reynolds 291) During this time, there were warnings, investigations, legal actions and attempted resolutions. Regardless of the warnings, investigations, and legal actions, more incidents of overdose occurred with the Therac machines. The flaws in the development of the Therac-25 were not simply limited to the system software, however. Aside from the flaws inherent in the system based on the software, one might conclude that there were flaws in how the system was tested and reviewed, as well as flaws related to the business management procedures and individuals involved in the development of the system. Going by what we\'ve learned from the Therac incidents, one critical area that any business tasked with the development of safety critical systems should address is that of process design and management. From a process standpoint, there was too much confidence placed in the new Therac-25 software. This confidence, as our text tells us, was based on the success and FDA acceptance of the software used in the preceding Therac models, which became the foundation for the code that controlled the new Therac-25. (Reynolds 291) The thought was that, if it worked before, it should still work. In this case, reliability, which our text defines as the measure of the rate of failure in a system that would render it unusable over its expected lifetime, was mistaken for safety, to the detriment of human life. (Reynolds 278) Therefore, with respect to the development of safety critical systems, the reuse of old software should be considered a dangerous practice. Aside from overconfidence, the development of the Therac-25 machine also suffered from complacency on the part of AECL. When issues did arise with the system, AECL failed to properly identify the root cause of the problem, which resulted in further incidents. (Reynolds 291) During my time as a retail Loss Prevention Manager, I learned that the initial step to the development of any strategic action plan aimed at preventing future losses or harm must first be the identification of the root causes of those losses. Also, had AECL paid closer attention to the software engineering process by establishing and executing proper quality assurance procedures, it might not have taken a physicist, acting as a third party, to identify the defect in the machine and its software after four human fatalities and a total of six incidents. Through quality assurance AECL could have identified that the software was overly complex and simplified the design by not making the software solely responsible for safety monitoring. Furthermore, the user interface should be designed to allow for the user of the system to properly identify that safety issues were resulting. (Reynolds 279) There should have been extensive Failure Mode Effects Analysis, code-analysis, and testing performed with the Therac-25 system, along with audit trails that reflected such actions. (Reynolds 280)
The incidents arising from the development and use of the Therac-25 machine allow us to clearly see how complacency, along with poorly defined testing and development procedures can lead to fatal results with respect to safety critical systems. These types of results, however, can be prevented through proper action and the adaptation of essential quality management standards, such as those that are part of the ISO 9001 family of standards. (Reynolds 279) The development of safety critical systems absolutely must involve these types of higher standards. Testing must be more rigorous, and must be conducted in a manner consistent with the notion that people\'s lives may be at risk. Special attention must be placed on careful software coding and analysis. Safety engineers must be involved, and documentation must be thorough, so that root causes can be identified when something does go wrong. Finally, when an incident does occur, complacency cannot be an option when discussing a system with safety critical implications. Had AECL acted accordingly, the damage created through its development of the Therac-25 machine may not have been so grave.
Reynolds, G. W. (2015). Ethics in information technology (5th ed.). Boston, MA: Cengage Learning.
...(download the rest of the essay above)