Case Studies of Failures in Globally Distributed Systems
Study of Failures in Gmail’s Globally Distributed Systems – 2008-09
Malhar Chaudhari (Author)
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA, USA
Abstract—From August 2008 to September 2009, Gmail suffered multiple outages that affected a varying subset of estimated 26 million users worldwide. Gmail, still in its beta phase, faced outages due to failures rising from incorrect software upgrade execution to maintaining data locality. This paper studies the outages faced by Gmail in its early years of public use which make it a comprehensive guide of fault analysis and recovery to fault tolerant designs for cloud based startups.
Keywords—component; formatting; style; styling; insert (key words)
This paper would discuss some of the key outages that provide an insight into fault analysis and design. The first incident occurred on August 7 2008, 2 PM Pacific Time that led to denial of service and left some users with a 502 - Bad Gateway error. The issue’s root cause as analyzed by Gmail was due to a temporary outage in the contacts system used by Gmail which prevented it from opening properly. This incident studies the interdependence of different applications that can lead to a massive outage across applications that independently might have never failed.
The second incident occurred on February 24, 2009 at 1:30 AM PST for about two and a half hours, affecting users in the European and Asian Subcontinent. This incident was caused during a routine maintenance event in one of its European Data Centers. The disruption was caused due to a upgrade that keeps data geographically close to its owner. This incident focuses on how geographically independent systems improve performance while facing higher vulnerability of downtimes.
The third outage occurred on September 1, 2009 at 12:30 PM PST that lasted for 100 minutes. This disruption was caused due to load underestimation during a routine maintenance event. Ironically the servers that led to outage due to overload were designed to improve service availability in the first place. This incident is a classic cascading router failure occurring in a overloaded system and the recovery procedures carried out by Gmail engineering team provide an insight into increasing capacity and distributing traffic across all of them.
II. System Architecture
A. High Level Architecture
Gmail architecture is basically a combination of a generic backend architecture powering all Google services on which Gmail is a Software as a Service Application. This structure allows google to build some unique designs into their backend architecture featuring high reliability, consistency and availability with immediate scaling to multiple services. It also opens its backend architecture and provides it as a platform-as-a-service using APIs which act as a protocol for interfacing between different applications.
Google is one of the few software firms that take up systems engineering as a design and process engineering problem rather than just an expense. They believe in customizing not only their software applications, but custom OS to custom server machines.
Another important point to note is Google’s commitment from the start to building strong reliability components into their architecture. Their architecture design requirements on a high level are Reliability, Scalability, Openness, and Performance. Google claims a 99.9% uptime per month and a benchmark performance of query execution at 0.2s.
Thus, for all future references to architecture in this paper, we will refer to Google’s architecture. Another point to note is that, the architecture studied here is as of 2008, which is about the time when the incidents under study occurred.
B. Components of the Architecture
Google’s architecture is composed of essentially the following components:
1. Network layer
2. Computing Platforms
3. Distributed System Enablers
4. Applications (Gmail)
The interior network enabling Google’s architecture is based on IPv6. It also uses a squid reverse proxy which is a web server accelerator that caches, frequent queries giving a huge performance improvement by having 30-40% cache hit ratio. The external network is separated from the Google’s interior network using a firewall.
Within the datacenter, the network starts at the custom-made servers. Each 19-inch rack contains 40 to 80 servers. Each rack has a switch. Servers are connected via a 1 Gbit/s Ethernet link to the top of rack switch (TOR). TOR switches are then connected to a gigabit cluster switch using multiple gigabit or ten gigabit uplinks. The cluster switches themselves are interconnected and form the datacenter interconnect fabric using the dragonfly design.
Fig. 1. Google Architecture (Note: This is a high-level view of the architecture)
Google’s approach to building its infrastructure differs from the industry in one major aspect. It builds custom machines from cheap commodity devices for data centers and build software that accounts for reliability of the underlying hardware. This approach has multiple advantages,
1. It is extremely cheap to use commodity x-86 server computers, that have best performance per dollar and not overall best performance. This also allows the company to invest in more redundant hardware and replace the machines when they fail.
2. Build hardware with custom motherboard and mount them bareback onto racks, which reduces power consumption for cooling. Interestingly every server has a small 12-Volt battery that increases energy efficiency.
3. Use significantly modified Linux OS, that allows for optimization specific to hardware requirements. Some of the examples are custom implementations of dynamic memory allocation “malloc” package, that gives a higher throughput when allocating memory. It also customized Linux to support its IPv6 network, before it became a common usage on the Internet.
To put into perspective the scale of the system, the combined computing power of servers owned by Google, was estimated to be around 100 petaflops in 2008. Due to customized hardware, Google claims that it uses less than 0.01% of the world’s electricity and has a maximum 30% saving in energy per server when compared to other conventional servers. Servers as of 2009–2010 consisted of custom-made open-top systems containing two processors (each with several cores), a considerable amount of RAM spread over 8 DIMM slots housing double-height DIMMs, and at least two SATA hard disk drives connected through a non-standard ATX-sized power supply unit.
Distributed System Enablers:
This is the core of Google’s backend infrastructure that provides it the performance edge while maintaining reliability across it’s globally distributed system. It is composed of three components:
1. Google File System (GFS): It is a distributed log based virtual file system spread over multiple systems. This allows Google to scale up their system much better than anyone else in the industry. Files are divided into chunks of 64 MB (much larger than conventional systems), each labeled by a unique 64-bit address. GFS consists of two types of nodes, master and several Chunkservers. While Chunkservers store the actual data, Master server stores metadata like the mapping table of Chunkservers and plays a pivotal role in granting permission to processes to modify Chunkservers that fall within its mapping. Unlike other file systems, GFS is implemented in the user space library. Google built their own file system as opposed to an off the shelf file system to have high reliability, scalability, throughput and support for large blocks of data. This is especially important for the Map Reduce computing that it adopts and allows for efficient distribution of data. GFS is optimized for sequential read and local access, thus making write operations much slower compared to conventional systems. It can be tuned for specific applications. Although increasing performances makes it unavailable to a new application coming online. GFS has known to have highest throughput of about 580 MB/s for 340 Nodes, often at the cost of latency.
2. MapReduce Computing Layer: MapReduce is a Google proprietary programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. In a MapReduce environment, the input reader, splits the input into typically 64 MB size chunks from a stable storage to generate key value pairs after assigning it to a map function. The Map function generates intermediate key/value pairs of data after typically carrying out some operations on them. Often a partitioner works on the data it receives to generate the index of the reducer for every key of the chunk, such that load-balancing is enforced. The Reduce function coalesces the values from the Map function for every unique key in a sorted manner, which is finally written to a stable storage. Often the data transferred between map and reduce servers are compressed to achieve high throughput in limited bandwidth environment at the cost of higher CPU utilization. Google could process Petabytes of data due to Map Reducer’s unique model of parallelization.
3. Bigtable: Bigtable is Google’s proprietary non-relational database that it built to suit its scale of operations. It is a large scale, fault tolerant, self-managing system, that supports terabytes of in-memory computations and Petabytes of disk storage while carrying out millions of read and write operations per second. It doesn’t support SQL and is optimized for applications at lower levels of storage. Machine replacement is extremely easy and plays well with the cheap commodity server policy of Google. Bigtable has all its data stored into structures called tablets, which is a 64KB block of SSTable data type. There are three types of servers in the Bigtable system. The first is the Master server which assigns and tracks tablets in tablet servers. The tablet servers process the queries received by them. The third type of server, the Lock server performs the task of access control and enforcing mutual exclusion.
On top of the system described above, Google runs its various application including Gmail. From a software perspective, Gmail works well with the Google specific infrastructure and presenting the information to the user using the most accessible technology. Gmail is built using Google App Engine, which itself is a combination of Google Web Toolkit (A set of libraries that allows to build high end front end Java Script tools), Django, Go and Google Cloud SQL. Its rich interface, supported by a high performing querying functionality made it one of the fastest adopted web based Email clients.
C. History and Facts about the Architecture
Google’s core architecture has remained the same as described above since its inception in 1998. Although certain components have gone under major changes, but the overall guiding philosophy has remained the same. Due to their design, which took scalability into consideration, Google could massively scale its operations. Also because of the design decision to opt for highly distributed systems, it is one of the few internet service companies that runs distributed data centers across the continents. Although it did face some issues, which is one of the incident of discussion, its design enabled this massive distributed architecture, that processes data in Petabytes daily.
Due to the novelty of its system, a lot of its models have been adopted in open source technologies and have become mainstream models of computing. Most remarkable example is the Apache Hadoop which is taken from the MapReduce. Some other examples of similar type are Apache HBase (Bigtable Equivalent) and Apache Zookeeper (Chubby).
III. Dependability Practices
Google’s Architecture can tolerate fail silent faults. These faults are easily caught by the system and the system is built to recover on its own, without need for external interference. Another aspect of the geographically distributed system, is that it provides higher reliability at the cost of maintaining a complex architecture.
Other failures that propagate from the bugs that exist within applications are often caught and resolved by the Site Reliability Engineers (SRE) who have built an extensive framework for any application that is production ready but is yet to be deployed to production environment. Often 20% of any application goes back to developers for updates that would meet SRE benchmarks. Some of which include, providing appropriate documentation, an important aspect of any fault recovery plan.
A. Chubby Lock Service
Google’s philosophy has been to build systems that are bound to fail. Because of their approach to build systems out of commodity hardware, they build software that can better handle faults in the hardware. One of the best innovations by Google in building a fault tolerant system, was the Chubby lock system, which is a distributed lock manager that organizes and serializes access to resources in loosely coupled distributed system. Chubby focuses more on reliability and availability as opposed to high performance. This hits the computation time negatively but provides high availability that no longer requires human intervention during failover. Chubby ensures asynchronous consensus, using PAXOS with leaves timers to ensure liveliness of the system. The rationale behind choosing a lock based service over a client PAXOS library based implementation leads to a reduction in the number of servers required to make progress in a reliable client system.
B. Replication Strategy
Google claims that it implements live or synchronous replication for its services including Gmail. Every mail is copied simultaneously over two data centers at once. Google maintains a SLA for 99.9% availability and zero Recovery Point Objective (which is the amount of data loss that can occur, when system goes down) and zero Recovery Time Objective (which is equivalent to saying instant failover). Google also maintains multiple redundant network routers within its data centers to route traffic to in case of router failure.
C. Threats to Dependibility
Although the dependability practices used by Google are one of the best in terms of industry standards, majority of the downtimes that the company has faced in its Gmail application arose from a software update gone wrong or a network routing issue. Google has incrementally updated not just the bugs but its model and design approach learning from its failures in the early years, and in the rare event that it faces a downtime, it is usually due to a software upgrade gone wrong.
The next sections of this paper study the different incidents and analyze their occurrence in the view of the system studied.
IV. Outage in Contacts – 7 August 2008
A. Origin of Incident
The first incident occurred on 7 August 2008, 2 PM Pacific Time that led to denial of service and left some users with a 502 - Bad Gateway error. Although the exact number of users affected were not reported by Google, it is estimated that it was a significant portion, that affected its online reputation.
The incident occurred due to a temporary outage in the Contacts Application of Google, which is heavily accessed in turn by Gmail. The outage in Gmail was due to a combination of factors that included a network overload causing additional load on the Contacts service. This was coupled with an update to Gmail that inadvertently increased the load on the Contacts application. In addition, the system saw high utilization from users which caused the series of triggers that ultimately took the system down.
Although the details of how the incident was detected internally within the system was not mentioned in the incident report, Google maintains that it’s Site Reliability Engineering team, has systems which automatically fix themselves in case of failures or alert the team of any unreliable operations even before they manifest as failure. But this incident caused the systems to become overloaded and even before the system could be fixed, it started giving a 500-series error message to users due to timeouts.
B. Evolution and Impact of Incident
The incident lasted for about 15 hours affecting a subset of more than 500,000 businesses and 10 million active users. A similar issue was detected on 24 September 2009.
Because of the extensive use of email in both business and personal communication, the outage caused an outbreak of complaints on the social media, to which Google had to issue a public apology. They also assured in their statement that a review of the complete incident was underway and they would be updating their software and systems to account for causes that led to this incident.
C. Mitigation and Recovery
To contain this incident, Google’s engineering team temporarily stopped all requests to use the Contact application’s feature from the Gmail interface and posted an alert to warn the customers that the Contacts could not be displayed using the links within Gmail and they could use an alternate domain to access their Contacts list.
To resolve this issue, Google added additional capacity to their Contacts service from their flexible capacity server farms. Subsequently within a few hours of capacity addition other features of Contacts were restored within Gmail.
Google claimed that no data was lost in this incident due to the number of backups that it creates, including on tape, to create a strong redundancy system that can recover data.
V. Maintaining Data Locaclity – 24 February 2009
A. Origin of Incident
The incident began with a routine upgrade in one of the European data centers of Google, which took the system down from 9AM to 12PM GMT / 1AM to 4AM PST on Tuesday, 24 February 2009, where Google Apps Gmail users were unable to access their accounts.
The root cause of the incident was identified to be a bug in the upgrade that caused an unexpected service disruption during the routine maintenance event. The software upgrade itself was meant to maintain the locality of user data to make more efficient usage of Google’s computing resources as well as achieve faster system performance for users.
On this day, when the maintenance task was performed, Google shifted the traffic from the data center under maintenance to another instance. This task by itself is invisible to the end user. But the instance to which the traffic was routed had the new upgrade that maintained locality and had a latent bug that got triggered which in turn caused the destination data center to be overloaded which ultimately caused multiple downstream overload conditions as user traffic was automatically shifted in response to the failures.
B. Evolution and Impact of Incident
Google’s premium customers which included The Guardian and Salesforce were guaranteed an SLA of at least 99.9% Gmail uptime per month, which translates to no more than 45 minutes of downtime a month. If they breached the SLA they were liable to a penalty payable to their customers.
C. Mitigation and Recovery
The actual mitigation and recovery process was done in steps and thus the actual outage period varied for every user. Google’s Site Reliability Engineers added additional capacity and adjusted the system to re-balance the load across the data centers to restore access to users.
Google reported that it had isolated the bug that caused the outage in the first place and fixed it. They again maintained that no data was lost during the incident.
Google also implemented a CAPTCHA to make sure that any bots accessing the system won’t be able to cause additional load on the system.
Google’s official statement claimed that, due to the global infrastructure and users, the traditional approach to system maintenance, for example system upgrades on weekends was impossible for the company to follow. Which prompted them to build resilience and self-healing system with a team monitoring the system 24x7.
Candidly google stated on their official blog, that such outages affected them equally as they too used Gmail internally to communicate.
VI. Lessons Learned
A. Architecture Concerns
Google’s architecture is unique in the way that they thought about their systems less like an expense to run their applications but more of a system’s engineering problem. This led to building an architecture that was built for reliability more than anything else. An example of this is the chubby lock system, that goes beyond its function as a lock within the system, but function in the wake of network partitioning often at the cost of system performance. System performance is gained via different routes (often at the application layer).
Even under such circumstances, its application faced downtime, which brings us to the core problem that given a system, one must assume that it will fail and design it for fault tolerance than building a fault-less system, as it is often only a theoretical concept.
Gmail focuses on availability over consistency in the wake of a network partition. The belief is that the system will be eventually consistent, because messages in Gmail at any given point can be in transit and thus enforcing consistency would slow down the system. This allows them to mask certain minor failures which would not be apparent to the end customer given the size of the results it returns or the knowledge of the fact that a message delivery takes time.
B. Preventive Measures
The Locality maintenance incident in our list showed that the system can run through the IMAP/POP request routers. The system’s ability to shift to this configuration is something that would have reduced the recovery time of the system. Although the answer to why not use IMAP as default protocol is searching such a large set of messages would become slow using this protocol and hence Google avoids its use.
Another thing to notice is that though there is data locality across their globally shared “spanner” database, the routers in a specific location don’t just service local requests. If this would have been the case, then containing the incident to a specific locality where the failure occurred could have been possible.
Google’s attempt to build self-healing system and carrying out a strict monitoring scheme, showed that with every subsequent outage, the outage time reduced significantly. This brings our attention to the fact that one of the best practices in building reliable systems is the system’s ability to self-heal and provide failure alert at the earliest.
C. Dependability Recommendations
Another important design philosophy that Google’s development team follows is that they have 70(Unit Test):20(Integration Test):10(End-To-End) Test philosophy. The approach followed at google is that it often focuses on the developer to do more regression tests than the testing team, and pushes for faster upgrades. Quick innovative upgrades made Google one of the fastest growing internet companies in the world, but this philosophy might have affected them in their early years when the system was still going under major developments. An integration test giving false positive results would have lesser probability and in turn a lesser effect on the final output. Moreover, being the pioneers in cloud computing and Software-as-a-Service applications, with very less industrial precedents to look up to, they might have been benefitted more from end-to-end testing. Another critical observation in the incidents above is that most of them were related to a bug in the upgrade. This brings us to the fact that with every software upgrade the system’s failure rate increases giving us a fluctuating failure rate vs time graph as opposed to hardware which has a more stable bath-tub curve.
...(download the rest of the essay above)