Mystery Solved: Amazon Web Services says “overwhelmed network devices” Triggered their Outage

If you’ve been wondering how that major Amazon Web Services (AWS) outage happened, and nervously asking, “Could it happen again?,” you’re not alone. The outage knocked out a slew of popular services like Venmo, Tinder, Disney Plus, and even Roomba, and the December 7th outage also put some Amazon deliveries on hold. Amazon experienced its last major outage around this time last year, causing a number of sites and apps to go down for hours.

Now, AWS has provided an explanation as to what caused the outage that downed parts of its own services, as well as the third-party websites and online platforms that utilize AWS. In a post on the AWS website, the company explains that an automated process caused the outage, which began around 10:30AM ET in the Northern Virginia (US-EAST-1) region.

“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” Amazon’s report says. “This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”

According to the report, this issue even impacted Amazon’s ability to see what exactly was going wrong with the system. It prevented the company’s operations team from using the real-time monitoring system and internal controls that they typically rely on, explaining why the outage took so long to fix. Amazon notes that service started didn’t start improving until 4:34PM ET, and the issue was fully resolved at 5:22PM ET.

Since Amazon’s Support Contact Center also runs on the AWS network, customers weren’t able to create support cases for seven hours during the outage. Amazon’s Service Health dashboard, which the platform uses to provide status updates, was also impacted, resulting in Amazon’s delayed acknowledgment of the issue. The company says that it’s working on a way to improve its response to outages, and plans on releasing a revamped version of the Service Health Dashboard that should help customers across receive timely updates if an outage occurs.


Photo Credit: Gil C / Shutterstock.com