To err is human, and this fact is perhaps most salient in no other industry except in Software Engineering. Good software is the product of thousands, maybe even millions of human hours of work put together. As any piece of software gets bigger and bigger, though, it becomes more prone to crashing or manifesting errors or bugs. And while some of these may be minor inconveniences, others can cause catastrophic damages, be they physical, financial, or both. Three such incidents resulting from software failure are the British Airways Terminal 5 failure, the 2014 Bitcoin hack, and the 1982 Soviet Pipeline explosion.
Heather Stevens and her partner, Neil, were due for takeoff from Heathrow Airport’s Terminal 5 in London, but they were forced to wait for three hours only to find out that plane was standing still at the airport and was still waiting for passengers’ luggage to be boarded (“Technical Glitches Hit T5 Opening”). The nightmare at T5 was one caused by glitches in the computing software of the baggage handling system of the terminal. Terminal 5 was meant to contain the biggest baggage handling system in Europe, and one of the biggest systems in the world. British Airways (BA) and British Airport Authorities (BAA) had spent upwards of £4.3 billion over the course of six years to build the entire terminal, which included the baggage system (Krigsman). Designed by IBM and Dutch company Vanderlande, the project was tremendous¬¬¬¬— the system featured state of the art technology to read, screen and sort baggage, to transport all bags to their designated locations and to fast-track early bags. It involved nearly 400,000 man-hours input into the project (Krigsman), along with “[running] 163 IT systems, 546 interfaces, more than 9,000 connected devices, 2,100 PCs and ‘enough cable to lay to Istanbul and back’” (Chapman). So, it was by no means a small project, and provides one of the reasons why this was prone to failure. Another reason was the failure to test the system properly, which is an important step in any software development cycle. One baggage worker notes, “They have been doing tests on the belt system for the last few weeks and knew it wasn't going right. The computer cannot cope with the number of bags going through” (“Technical…”). The baggage system was prone to failure because it was not tested as rigorously as possible, to deal with passenger luggage in one of the busiest airports in the world. Thus, when it came to a “real-life” situation, the system failed miserably, causing inconvenient delays and millions in lost revenue, as it cost just British Airways alone £50 million in one day.
The main takeaway from this is to understand that as any software grows larger, both in lines of code and complexity as far as interfaces handled, the risk of error increases dramatically. Thus, any software released at a global (or commercial) scale needs to be thoroughly tested using situations from real-life scenarios, to understand if the system can handle it gracefully and fix any potential errors that may be faced.
As financially catastrophic as the T5 incident was, it is dwarfed by the 2014 Mt. Gox incident. Mt. Gox was a bitcoin exchange based in Japan and was one of the largest in the world. In February of 2014, hackers managed to steal about 740,000 Bitcoin, which back then was worth around $460 million (McMillan) and around $2.6 billion today. Mt. Gox had somewhat mysteriously paused all online requests to withdraw, citing “A bug in the Bitcoin software makes it possible for someone to use the Bitcoin network to alter transaction details to make it seem like a sending of Bitcoins to a Bitcoin wallet did not occur when in fact it did occur” (Pollock). A few weeks later, the owner, Mark Karpelés, had stepped down, which was followed by the company website going offline a day later (Pollock). A leaked document had later revealed that hackers had “raided that Mt. Gox exchange and stole 744,408 bitcoins belonging to Mt. Gox customers, as well as an additional 100,000 bitcoins belonging to the company, resulting in the exchange being declared to be insolvent” (“The History of the Mt Gox Hack”). How did this happen? Although the facts still seem unclear, it would seem that the Mt. Gox private key, which was used to access the cryptocurrency online wallet, was unencrypted prior to 2011, was supposedly stolen “via a copied wallet.dat file, either by hacking or perhaps through an insider” (“The History…”). Because of this, the hackers were able to access and acquire Bitcoin without the knowledge of Mt. Gox, because to them it would appear as if normal transactions were taking place because of the copied Mt. Gox key. Surprisingly, this was not even the first incident involving this firm. In 2011, the first hack on Mt. Gox happened, which resulted in hackers stealing about 2000 bitcoin (Pollock).
There are two important lessons to be learned from the Mt. Gox debacle. The first involves the unencrypted private key before 2011 which was copied. This was one of the biggest reasons the hack(s) ever happened, and the owner of the exchange (before Karpelés, who bought the exchange in 2011) had either lied or failed to encrypt the key. It should be the responsibility, both legal and ethical, of the administrator of any financial institution to be able to secure access to the funds of its customers and those of the company itself, and clearly the owner(s) of Mt. Gox had failed. The second lesson is that no one person should have a majority of the oversight over an institution, because that leads to potential vulnerability in the system. It only took hacking of one private key belonging to the owner for this catastrophe to occur. In my opinion, at least, it would’ve been harder for a hacker to access the keys if 1) there were multiple owners or board members of the institution, and 2) the approval of all board members would be required to access the keys, whether it is a physical permission or a digital one. This way, no single act of hacking can ruin such an institution, and the members of the board will be accountable to one another to secure the key and no one board member can steal it.
An incident that could have caused mass casualties because of software is the Soviet pipeline incident. Though not technically an accident, it’s outcomes were not fully foreseen by the CIA, who were behind this incident. In 1982, the CIA learned about the Soviet plans to steal the plans of the control systems of the Soviet pipeline from an insider KGB source (Washington). So, the CIA planted booby-trapped software from where the software is supposed to be stolen. The software was meant to increase the pressure inside the pipelines when activated, resulting in the system going haywire and unable to be stopped. Sure enough, this is what had happened, but it resulted in the biggest non-nuclear explosion ever created and seen from space (Washington). There were no casualties, however. Given that this was not technically the result of an error, but still could have resulted in deaths, the only takeaway from this incident is that any sensitive operation that could involve people’s lives must be handled carefully. A perhaps safer way to handle this problem could have been to plant broken software that does not really work and maybe simply shuts down the plant, not with the results of potential mass casualties.
While good software is praised by all its users, from professionals to people who simply use a free mobile app, good software is hard to write, and is seldom without making mistakes. Some mistakes can cause mere inconvenience, such as the T5 incident, while others can cause catastrophic monetary loss, like the Mt. Gox incident, and potentially, in the loss of lives, such as the Soviet Pipeline incident of 1982.