The top three most likely reasons for a cloud outage

supercomputer
(Image credit: getty)

Cloud service providers appear to only give causation for outages when they affect a large swath of customers. 

Yet our research reveals the majority of high-impact and high-cost outages are what experts call “micro-outages” that affect one application, one service area, one region, or one availability zone. If your business depends on any of those ‘ones’ you may not get a full readout of what happened and how to mitigate any fallout in the future.

Here we’ll outline the three most likely causes of those micro- and macro-outages and the best ways to plan around them.

Outages caused by human error

If we go far back enough to determine the cause of a cloud outage, we’ll find some semblance of human error involved. Yet glaring human failings come to light when simple duties or tiny details become ignored by the people charged with completing them.

“In my experience, the most likely source of cloud outages is human error, somebody has released some code with a bug, some configuration file was not updated,” says Roy Illsley, chief analyst of IT Operations at Omdia. 

What’s an example of a human error-induced outage?

In February 2017, AWS experienced a major outage from an improperly entered execution command by an AWS technician.

What was the cause?

“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” according to an excerpt from the firm’s post-event summary (PES). “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

What was the impact?

The outage lasted for two hours between 9:37 a.m. PST and 11:37 a.m. PST on February 28, 2017. The outage knocked several major sites offline such as Trello, Quora, IFTTT, and WIX. Amazon’s own Alexa system and Nest application were affected by the outage also. 

Outages caused by automatically scaled errors

Small errors can become big problems quickly, especially when those small errors are part of an automated series of actions affecting cloud applications. That’s according to Brett Ellis, senior analyst at Forrester when he reviewed two recent automation-related outages.

What’s an example of an automatically scaled error-induced outage?

In December 2021, one automated activity to scale the capacity of an AWS service triggered a nine-hour outage affecting popular services such as Disney+ and Netflix. The outage removed several internet-dependent resources such as Roombas, Alexa-powered door locks, and Ring security cameras.

What was the cause?

An automated system didn’t function as expected, causing API requests to fail. Customers couldn’t modify or create new API calls. This created a connection request backlog that overwhelmed the connection between the AWS internal network and its main network for the US-East-1 region. The connection attempts piled up, causing even more latency between networks. The issue lasted from 7:30 a.m. to 2:20 p.m. PST.

What was the impact?

The impact of the outage was expansive, affecting Amazon’s retail arm during the most important month of the year for that sector. Reports of final exam delays and lack of access to online educational materials also poured into Twitter, Reddit, and other social media platforms. It didn’t help that the AWS Service Health Dashboard wasn’t live for nearly the entirety of the outage, allowing for widespread speculation as to the cause of the outage. 

Outages caused by natural disasters (or poor planning for one)

Natural disasters, and the lack of planning for them, wreaked havoc on cloud network availability in recent years. AWS isn’t the only CSP with a string of outages to its name. Microsoft Azure found that weather-related issues can reveal facility management issues. 

What’s an example of a natural disaster-induced outage?

In August 2023, Microsoft experienced an outage that was initially caused by an electrical storm near the firm’s Sydney data center region. It soon became clear, though, that other factors contributed to what turned out to be a day-long outage in its Sydney availability zone.

RELATED RESOURCE

Whitepaper cover with male and female colleague looking at, and pointing to, a digital padlock

(Image credit: Zscaler)

The business value of Zscaler Data Protection

Minimize the risks related to data loss and other security events

DOWNLOAD NOW

What was the cause?

An electrical storm on 30 August created a power sag affecting cooling units in one of Microsoft’s Sydney data centers. A facility that depended on five chillers to modulate temperature was reduced to just one because one of the facility’s two backup chillers didn’t back anything up. This triggered a shutdown of servers to manage the thermal load of the facility. Microsoft later admitted three staffers were too few to cover a facility of that size. 

What was the impact?

Oracle Cloud and NetSuite also experienced shutdowns due to the chiller/staffing issue at the Microsoft facility. “… the rush to expand services and geographic availability has impacted the implementation of geographically separated redundant systems, leaving hyperscaler services vulnerable to location-based failures due to weather, earthquake, and fire,” Ellis said.

How can businesses mitigate cloud outages?

For enterprises with core systems in the cloud, it’s imperative to leverage as many of the cloud service providers’ resources as possible. Ellis outlines three steps to identify, track, and report cloud outages for enterprises. They include:

Identify: Enumerate the nested components of the services procured through cloud service providers so that those components can be tracked.  

Track: Connect service and product owners to relevant alerts. “If you purchase a SaaS healthcare application built on top of Salesforce, you might want to inquire with the vendor about where they host their components so you can add that environment into your Network Operations Center (NOC),” says Ellis. Relate regional monitoring data to an actual app purchased and have an internal service owner assigned, as well as an alerts contact. “You end up with something like a weather report for your apps where you know there is a possibility of a slowdown in your app because one of the component services is degraded or offline,” he adds.  

Report: Communicate with the hyperscaler in question about key service owners, and how to reach out. “As for updates and retrospectives, I think the expectation is that a SaaS vendor or cloud provider should have a page where updates happen on a regular interval, something between 15 minutes and an hour, depending on the service that is offline and how critical it is until the outage is resolved,” says Ellis. An example of an honest, public retrospective is Atlassian’s post-incident review for its April 2022 outage.

Failures of service providers will vary depending on the points of failure each provider has. Be sure to understand technical risks so you can mitigate them in accordance with your firm’s needs and available resources. Ellis recommends talking about technical risks prior to contracting with any hosted service, be it SaaS or cloud infrastructure. 

Lisa Sparks

Lisa D Sparks is an experienced editor and marketing professional with a background in journalism, content marketing, strategic development, project management, and process automation. She writes about semiconductors, data centers, and digital infrastructure for tech publications and is also the founder and editor of Digital Infrastructure News and Trends (DINT) a weekday newsletter at the intersection of tech, race, and gender.