Last week's hours-long Amazon outage occurred because of human error.
On February 28, 2017, there was a glitch in Amazon's web servers. The glitch caused an outage that lasted for hours and affected several websites and apps. Medium, Business Insider, Slack, Quora, Giphy and other sites experienced problems due to the outage. Even the site “is it down right now?”—which, ironically, shows which other sites are down—went down. When the automation service Ifttt went offline, devices like smart-phone controlled light switches experienced problems. People were unable to do something as simple as turning on the lights at home.
Beyond the frustration and inconvenience, however, the outage was symptomatic of a much deeper problem in how the Internet works.
So what exactly caused the Internet to, well, break? Amazon has a cloud service it calls the Amazon Simple Storage Solution (S3). Tens of thousands of Internet services and sites use S3 for data backups and hosting. Once S3 broke down, it took all these dependent companies down with it.
S3 had been experiencing problems with its billing service, which an Amazon engineer had been trying to fix. The engineer had taken some servers in S3's billing subsystems offline to find out what the problem was.
One of the commands the engineer entered had an incorrectly-typed input. This error caused more servers to go offline than expected. These servers supported two other S3 subsystems, one of which supported metadata and location information. Though the design of S3 allows some of its subsystems to fail without knocking the entire system down, it didn't work this time. Taking the wrong amount of the wrong servers caused something akin to a domino effect of errors within the system.
However, human error wasn't the only cause of the Amazon outage. Amazon hadn't been doing its due diligence regarding S3 reboots and important safety checks. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” Amazon said in a statement.
Amazon did manage to fix the problem later that same day, but repairs took hours. The company issued an apology for the outage and all the problems it caused. Fortunately, it also claimed that it placed safety measures against any human errors that might occur in the future.
The outage brings a potential problem to light. At present, the entirety of the Internet relies a lot on a number of big players. These players include companies like Amazon, Google, and the like. These companies provide the infrastructure essential to keeping the Internet running. As this recent outage shows, however, relying on these companies too much may be problematic in the future. If another company experiences another glitch, or if one if its engineers makes another human error, will this outage reoccur?
These companies, however, are the only ones who have the resources to keep the Internet running. For now, let's hope that they've learned a few important lessons from the Amazon outage.
Get weekly science updates in your inbox!