Software Safety: A Survey of Concepts and Techniques
I wrote this lit review-style survey after doing a research dig on software safety to get a sense of what the current research and practice is for safety-critical software systems. This paper really does not do justice to the significant amount of material that is out there, but I think it at least skims the surface for someone who might just be trying to find a place to start reading and learning more. At the same time, I think the structure of this paper (each section subsequently shorter in length) mirrors the literature that is out there. There’s a lot of folks writing about how important safety is, a few less people writing about safety requirements, even less on designing software for safety, and not much at all on testing safety-critical systems. I originally wrote this paper in July 2007 and I am now publishing it on the web now for the software community to read and refer to as needed.
Abstract
Although software is increasingly becoming a part of safety-critical systems that control aircraft, automobiles, heavy machinery, and medical devices, few software engineers are aware of the issues relating to safety-critical systems and the techniques that are currently available, along with their limitations. To effectively design a safety-critical software system one must consider safety throughout the software development lifecycle, and not as an afterthought. Safety requirements must be explicitly specified, an active approach is used to address safety in design and implementation, and the safety constraints of the software must be verified and tested properly before the software is deployed in a safety-critical situation. Instead of being an afterthought to software engineering, software safety is more of an extension of the software development methods of which software engineers are already familiar, with added emphasis on specific techniques to address safety-specific concerns.
Software Safety: A Survey of Concepts and Techniques
When software professionals think of software safety, the infamous Therac-25 incident comes to mind. The Therac-25 was a medical device designed to administer radiation therapy. However, flaws in the Therac-25 software resulted in some patients being seriously harmed by radiation overdoses and burns. Leveson and Turner’s (1993) case study is required reading in many undergraduate computer science programs, but beyond knowing that software safety is an issue, few software engineers know how to make software safer. Creating safer software involves using methods and techniques during all phases of the software development life cycle. From determining safety-related requirements to developing test cases for safety conditions, the notion of safety must embody the approach to software development for the resulting software to be considered safe.
Although emphasis is placed on making software safer, safety must be considered at the system level. Moray and Huey (1988) define the layers of a complex socio-technical system as:
- the technical/engineering system,
- the workers that operate the system,
- the organizational/management infrastructure that governs the workers, and
- the environmental context in which the organization operates.
Vicente (1999) states, “There is now a great deal of evidence to show that organizational factors play a very important role in system performance and safety.” This implies that the design of the technical system, including the software, should take factors external to itself into consideration when addressing safety issues. Since people use the software system, human factors need to be considered for safety to anticipate the potential actions a user might take in the software if certain scenarios arise. In addition, users may follow guidelines from their management and the software may need to address those guidelines to maintain safe conditions in the overall system.
Software safety is important in situations that have potential to cause harm to humans, but situations that pose hazards are equally important. It is natural to think that a scenario where software failure results in bodily harm or death as a safety-critical system, particularly since the Therac-25 is the most well-known case study on software safety. Knight (2002) defines safety-critical systems as “those systems whose failure could result in loss of life, significant property damage, or damage to the environment.” However, many accidents are not caused by a single error, but rather multiple errors that compound to create an unsafe condition in a system. Wickens (2000) classifies errors in terms of active failures and latent conditions. An active failure occurs when a specific action causes a error, while a latent condition is a scenario where an action introduces or contributes an error condition into the system, but the error itself does not occur until a later state when other conditions fall into place at the right time. Similar types of latent errors occur in software: the software may continue to function “normally” but an error may surface later or in another form, either within the software, or outside the software within the realm of the socio-technical system. Leveson (1995) notes that safety-critical systems are rarely defined in terms of the potential catastrophic consequence of an error, but rather the hazard conditions that can be introduced by software, which in turn may contribute to a safety-critical condition in the system.
Safety Requirements
In creating software for safety-critical system, the first step is to consider and determine the safety requirements. In their software safety standard, NASA (1997) stresses the importance of requirements by stating, “The successful development of safety requirements for the software requirements specification is essential to developing safe software and allows for safety to be built into the software early in the life cycle while it is relatively inexpensive.” In the requirements phase of a project, a hazard analysis (U.S. Department of Defense, 1984) should be performed to identify a list of potential hazards in the system, assess the hazards, and then translate the hazards into software requirements, specifically in the form of design constraints. For example, in a software-controlled building security system, the doors not being able to open during a fire might be a hazard. The resulting requirement (or constraint) generated from that hazard might be that the doors can only be locked when no fire alarm in the building has been pulled. In addition to performing a hazard analysis, an investigation of the conditions that cause a hazard to occur, through fault tree analysis, for example, may help to elicit additional safety-related requirements.
Most software engineers are familiar with using natural language (English) to describe requirements in a specification document, but more formal mechanisms can better express safety-related requirements. Formal mechanisms allow for better analysis and reasoning about the modes and states the software can be in, as well as the values that the software may encounter. A finite state machine is an example of a formal mechanism that describes a safety feature of a system or how the system reacts to hazard conditions. In the building security example, a finite state machine can describe the possible transitions between a normal operating state and an emergency (fire, evacutation) operating state, along with the resulting effects. That particular finite state machine can then be examined to see if any transitions were not considered or if any of the resulting effects can pose a hazard. Another means of stating safety requirements formally is through formal specification languages, such that formal analysis techniques can be used to later verify that the safety properties are satisfied (Lutz, 2000). The formal specification languages express safety requirements in mathematical terms such that the interpretation is definite, especially compared to natural language, which can be interpreted differently in varying contexts. Leveson (2004) developed a CASE tool, SpecTRM, to help software developers more formally state system and software requirements in a combination of graphics and text, and then allows users to run automated routines to check the models for inconsistencies and errors that may lead to hazard conditions.
Once the safety requirements have been analyzed and stated in the software requirements specification (SRS), it is necessary to validate the requirements for completeness. The IEEE (1998) lists the following as one of the criteria for completeness: “Definition of the responses of the software to all realizable classes of input data in all realizable classes of situations.” In addition to the broad statement, Leveson (1995) lists 60 specific criteria to examine for requirements completeness. A sample of these criteria includes: startup and shutdown, mode transitions, inputs and outputs, value and timing, load and capacity, failure states and transitions, and latency. Just as most software errors trace their source back to requirements, most software safety issues trace back to non-existent or insufficient safety requirements. Therefore, effective validation of the requirements specification, including the safety-related requirements, will result in fewer errors and hazards being introduced into the system and propagated into subsequent phases in the software development lifecycle.
Safety in Design and Implementation
The software requirements, including the safety requirements among them, are the basis in conceptualizing the software design and producing the software implementation. That is, every requirement should be traceable to an element in the design and to lines of code in the implementation. Several design elements exist to implement some of the safety requirements, including inhibits, traps, interlocks, and assertions to control, reduce, or eliminate hazard conditions. The underlying strategy behind these design elements is to actively check for failure or hazard conditions and handle them appropriately before they propagate to another part of the software. Various levels of checking can be designed into the software to stop the propagation of errors. Leveson (2002) details a hierarchy of software checking. The levels include:
- Hardware checks, where the software actively detects hardware failures and instruction errors.
- Code-level checks, where the software actively checks for invalid values, ranges, and states.
- Audit checks, where a separate process checks data and timing in the main process.
- Supervisory checks, where a separate system (human or computer) checks the main system. Lutz (2000) refers to this as runtime monitoring.
In addition to the safety-related design elements, the design of the non-safety-critical portions of the program needs to be considered as it may in turn affect safety. Parnas, van Schouwen, and Kwan (1990) state that in safety-critical systems, “the system must be structured in accordance with information hiding to make it easier to understand, review, and repair”—a hint that designing for safety starts with good general design practices. Integrating these design elements and strategies, one can explicitly design and code features to not only prevent hazards, but to detect and manage them as well, thus providing an end result of a safer software system.
Testing for Safety
Even if the software is designed with a safety strategy in mind and implemented with safety-related design elements, hazard conditions in the resulting software may still exist because of errors in the design and implementation. For this reason, verification of the code for correctness and validating the implementation against safety-related requirements is an important step in the process of developing safer software. Two primary strategies are static and dynamic analysis, where static analysis involves using tools or people to analyze or inspect the code, while dynamic analysis involves running tests and examining the results to ensure none of the safety requirements (constraints) are violated. Parnas, van Schouwen, and Kwan (1990) suggest that effective testing of safety-critical systems requires a basis of mathematical documentation to test against. That is, using the finite state machines to define the test cases. Although many testing strategies exist, many of which improve the effectiveness of checking safety requirements, a parable of testing still holds true: testing will not prove the non-existence of errors in a system, it will only show the presence of errors in a system. This implies that even if safety is taken into consideration throughout the entire software development lifecycle, there is no such thing as safe software, there is only safer software.
Conclusions
In analyzing the currently available techniques to develop safety-critical software systems, one notices that the principles of designing safer software are not independent from the fundamental principles of software engineering. Software safety is not a separate field and does not involve a different approach to developing safer software, but rather rooted in good software engineering. For instance, good software engineering principles call for investing sufficient time and effort into requirements analysis because that is where most errors in a system originate. Likewise, most safety issues in software originate because of insufficient requirements and safety constraints that need to be designed into the software. Software safety involves extending software engineering practices by supplementing it across the entire software development life cycle with analysis techniques, design strategies, and verification methods to emphasize safety by preventing hazards.
References
Dunn, W. R. (2003). Designing Safety-Critical Computer Systems. Computer, 36(11), 40-46.
Herrmann, D. S. (1999). Software safety and reliability: techniques, approaches, and standards of key industrial sectors. Washington, DC: IEEE Computer Society.
Heimdahl, M. (2007). Safety and Software Intensive Systems: Challenges Old and New. Proceedings of the Conference on The Future of Software Engineering, 137-152.
IEEE. (1998). “IEEE Std. 830: IEEE Recommended Practice for Software Requirements Specifications.” Software Engineering Standards Collection. New York, NY: The Institute of Electrical and Electronics Engineers.
Knight, J. C. (2002). Safety-critical systems: challenges and directions. Proceedings of the 24th International Conference on Software Engineering, 547-550.
Leveson, N. G. (2002). System Safety Engineering: Back to the Future. Unpublished manuscript, Massachusetts Institute of Technology. Retrieved on June 24, 2007, from http://sunnyday.mit.edu/book2.pdf.
Leveson, N. G. (2004). A Systems-Theoretic Approach to Safety in Software-Intensive Systems. IEEE Transactions on Dependable and Secure Computing, 1(1), 66-86.
Leveson, N. G. (1995). Safeware: System Safety and Computers. Boston, MA: Addison-Wesley Professional.
Leveson, N. G., & Turner, C. S. (1993). An Investigation of the Therac-25 Accidents. Computer, 26(7), 18-41.
Lutz, R. R. (2000). Software engineering for safety: a roadmap. Proceedings of the Conference on The Future of Software Engineering, 213-216.
Moray, N., & Huey, B. (1998). Human factors research and nuclear safety. Washington, DC: National Research Council, National Academy of Sciences.
McDermid, J. A., & Pumfrey, D. J. (2001). Software Safety: Why is there no Consensus? Proceedings of the 19th International System Safety Conference.
NASA. (1997). NASA-STD-8719.13A Software Safety Standard. Washington, DC. Retrieved on July 15, 2007, from http://satc.gsfc.nasa.gov/assure/nss8719_13.html
Parnas, D. L., van Schouwen, A. J., & Kwan, S. P. (1990). Evaluation of safety-critical software. Communications of the ACM, 33(6).
Storey, N. R. (1996). Safety Critical Computer Systems. Boston, MA: Addison-Wesley Longman.
U.S. Department of Defense. (1984). MIL-STD-882B System Safety Program Requirements (AMSC Number F3329 FSC SAFT). Washington, DC. Retrieved on June 24, 2007, from http://sunnyday.mit.edu/safety-club/882b.htm.
Vicente, K. J. (1999). Cognitive Work Analysis: Toward Safe, Productive, and Healthy Computer-Based Work. Lawrence Earlbaum: Mahwah, NJ.
Wickens, C. D., & Hollands, J. G. (2000). Engineering Psychology and Human Performance, 3rd ed. Prentice Hall: Upper Saddle River, NJ.














