Reducing Errors in Software: From Formal Methods to Fault Tolerance

Table of Contents

Abstract

Buggy software is the term that describes incomplete software, in the sense that it may still have errors in it. This essay explores the many ways in which software developers have designed systems that are able to reduce the amount of errors. Errors in software not only affect the developers of the software but also the users. This essay will discuss they ways in which buggy software has affected society, not only in recent years but many years ago.

Introduction

As there has been a proliferation in the use of software in recent years, the need for accurate and correct software has become increasingly more critical to software developers. A software is said to be correct, once it has met the specifications that were originally set. Some of the systems that are built nowadays can be crucial and so it can be exceedingly life-threatening if there are errors in the software. For these reasons, there are now tools and techniques available that are able to reduce the amount of errors and flaws in the software. These approaches allow the software to arrive at an almost identical replica of the specification. Over many years there have been a multitude of high profile software failures due to mistakes in the code, all which have could have been avoided using techniques such as formal methods, good software practices and fault tolerance.

Section 1: High Profile Software Failures

Software that contains errors and that is not said to be correct can have high financial implications but the testing of this software requires additional time, which may also come at a cost. The decision is whether the developers want to pay for the consequences of software full of errors or whether they would rather take the extra time required to deploy the tools and techniques to reduce these errors. Developers may choose to utilise these approaches, as a bug in a system can be costly and the production of the system will have taken time and for this to crash will be, at the very least, frustrating and disheartening. The lack of testing can have effects on not only the developers but also the general public. In 2015, customer transactions for over 600,000 Royal Bank of Scotland accounts failed to go through. This affected many customers, including Jack Admans, who was unable to pay for a train journey to a job interview as he was relying on the transaction of money from the Royal Bank of Scotland (Farrell & Fishwick, 2015).

Section 1.1: Mars Climate Orbiter

A software failure on a larger scale was that of the Mars Climate Orbiter in 1999. For 9 months, software developers, astronomers and many other teams at NASA spent their time creating a system that would be able to study the environment of Mars and would communicate with the Mars Polar Lander. Unfortunately, after all of this effort and time, as soon as the Mars Climate Orbiter entered the atmosphere of Mars, it became too hot and caught fire. NASA engineer Richard Cook spoke on this issue – “It was pretty clear that morning, within half-an-hour, that the spacecraft had more or less hit the top of the atmosphere and burned up.” This was due to the miscalculation of the units used to control the system (Grossman, 2010). The orbiter was assumed to be travelling at a speed measured in newtons per second but, in actual fact, the speed was measured in pounds per second. This miscalculation occurred because of the transfer of information being misinterpreted between a team in Colorado and another in California (Hardin, Isbell & Underwood, 2007). This example of a software failure proved that having scheduling and money pressures led to rash decisions being made around testing (Reilly, Sausar & Shenhar, 2008).

Section 1.2: Goto Fail

In February 2012, one of the largest technology companies, Apple, produced an update in many versions of their operating system (Wheeler, 2016). There was only one small problem with this update which led to major flaws in the process. In the open source code for the update, it was noticed that there was two identical lines of code one after the other, which was the line “goto fail”. Due to the presence of the second “goto fail”, this meant that no matter what the argument, “goto fail” was always executed. It also meant that invalid arguments were accepted by the program (Wheeler, 2016). It was found that this problem did not relate to the actual software but to a simple typing error by the programmer. It is thought that the programmer to blame was caught out by copying and pasting which meant that two “goto fails” were coded (Chen, Lazar, Wang & Zeldovich, 2014). It did not help the situation that both “goto fails” were indented which could have possibly misled the programmer. This small typing error caused many Apple customers to be at risk of man-in-the-middle attacks. A man-in-the-middle attack is defined as when an attacker is able to send messages back and forth from a client to a safe server across the Internet, portraying the characteristics of the server to intercept all communications after this (Bland, 2014). This seemingly minor issue could have been prevented by implementing unit tests or even peer code review. This would have introduced another programmer to view this code, who may have identified this additional line.

Section 2: Formal methods

Formal methods are commonly known as mathematic techniques for specification and verification of software systems. There is a non-exhaustive list of examples of the formal methods available. One of these examples include state machines. A state machine is a machine that holds the state of something at all times and operates depending on the state of the machine (Kicklighter, 2005). An example of a state machine in a real life situation is an ATM machine. The user enters their card and the machine will change state and carry out the specific function and so on.

Another example of a formal method is Hoare Logic. This relates to a set of rules for reasoning about correctness. The aim of this approach is to provide a system for software correctness in relation to the reasoning of it. A term used in this technique is ‘Hoare triples’; this consists of a precondition, the operation and a post-condition. The precondition being a statement that the client must match. The post-condition is what is true after the operation of the system. For the system to continue to run with the presence of errors, the precondition should include the fact that there could be errors and the post-condition should be able to act on this fact (Aldrich, n.d.).

Model Checking is a term which describes a technique that checks the system for errors and is thought to be an effective way to debug software (Baier & Katoen, 2008). The technique of model checking does not work on the actual system but instead on the model of the system that is produced before the development of the system. This approach answers many questions that a developer may ask, such as “Does the system match the requirements of the user? Are the requirements easy to understand and carry out?” (Palshikar, 2004). The concept of model checking is that it will output ‘yes’ if the model meets the requirements and if it does not then it will provide a counterexample. This counterexample will explain why the model does not meet the specification. By doing this, the developers are able to pick out the error and alter the system (Palshikar, 2004).

Section 3: Good Software Practices

Test Driven Development (TDD) is a technique used to reduce the errors in a software. TDD works on a system like traffic lights. Red meaning the system has failed the test, green meaning the system has passed the test and refactor meaning the system can now be altered and improved (Hammond & Umphress, 2012). This approach is used before the code is written and tests are then written by the developer that will implement the code. The system must pass all the tests, so the developers can have high confidence that the code is error-free (Maximilien, Vouk & Williams, 2003).

The concept of Peer Code Review introduces another programmer who reviews code already written. This means that errors may be identified that may not have been otherwise. Once a programmer has produced code, they submit it to a review system where other developers can go through their code and suggest changes and improvements and also pick out errors. This cycle of developing and reviewing code can be repeated as many times as necessary until the developer is confident that their code meets the specification (Heinzl, Kude, Schmidt & Spohrer ,2013).

Section 4: Fault Tolerance

When software is fault tolerant it means that the system is able to accept errors and deal with them in the appropriate manner. For a piece of software to be fault tolerant it must have the capability and measures in place to ensure that when an error arises the software is able to accept this error without the whole software system failing. One way, to write fault-tolerant programs is to use Erlang.

Erlang is a programming language that was written for designing software that will be fault tolerant. The aim of Erlang is to accept the fact that systems do crash but faults can be ignored and allow the software to continue. The main reason that Erlang is able to write fault tolerant software is due to the fact that it has built-in features that support fault tolerance and concurrency (Erlang, n.d.). Erlang is a highly popular language used to write applications such as WhatsApp, Amazon EC2 and CouchDB.

Netflix’s Chaos Monkey is another example in which fault tolerance has been introduced. Chaos Monkey was designed and developed by engineers to test how resilient Netflix was and also if it had the ability to recover from errors. The concept of Chaos Monkey is that it picks a server at random and terminates it (McAllister, 2012). Although, at first, this may not seem the smartest thing to do, it forces errors early so developers need to deal with these and fix it. This means that the developers can prepare for the worst possible scenario and if a major flaw was to occur then the whole system would not crash. (Haughn, 2012).

Conclusion

There is no possible way that any piece of software can be error-free. To test every input possibility would take an extensive amount of time that developers do not have and, even if they did, not every possibility would be tested. This is why formal methods, good software practices and fault tolerant systems have been developed. If software developers decide to implement these methods then it will reduce the amount of errors and will increase the developer’s confidence in the system. There is no doubt that the need for these methods will grow, as the development for critical computer systems increase.

Essay: Reducing Errors in Software: From Formal Methods to Fault Tolerance

Essay details and download:

Text preview of this essay:

Abstract

Introduction

Conclusion

About this essay:

Essay details and download:

Text preview of this essay:

Abstract

Introduction

Conclusion

About this essay:

Essay Categories: