Amazon's huge cloud computing outage tracked to bad keystrokes

By Brian Fung

Washington Post·

2 Mar, 2017 08:24 PM2 mins to read

An outage across large parts of the internet has been blamed on a simple employee mistake.

Amazon is back with an apology and an explanation for a high-profile malfunction that caused websites across the Internet to grind to a halt for hours on Wednesday.

The online retail giant, which runs a popular cloud computing platform for sites such as Airbnb, Netflix, reddit and Quora, is blaming the outage on a simple - and perhaps somewhat amusing - employee mistake.

A team member was doing a bit of maintenance on Amazon Web Services Tuesday, trying to speed up the billing system, when he or she tapped in the wrong codes - and inadvertently took a few more servers offline than the procedure was supposed to, Amazon said in a statement. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly.

The cascading failure meant that many websites could no longer make changes to the information stored on Amazon's cloud platform. For everyday users, that meant being unable to load pages, transfer files or take other actions on some of the sites they regularly use.

"In this instance, the tool used allowed too much capacity to be removed too quickly," Amazon said. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

Translation: Employees will no longer be able to unplug whole parts of the Internet by mistake.

Amazon said it was sorry for the outage's effect on its customers and vowed to learn from the incident. One immediate next step? The company said it will subdivide its servers even more than before "to reduce blast radius and improve recovery," should something like this happen again.