Amazon Prime Day: A lesson in how not to handle an IT outage
Tips for avoiding and mitigating downtime on what could be your busiest time of the year


Nobody likes it when their website goes down. Not only is it highly embarrassing, for organisations that do business through the web it can be expensive in terms of both revenue and reputation. Yet sites do go down, and even the biggest and best-known names can have problems.
You know that piece of buttered toast that when dropped always seems to land butter side down? Unscheduled downtime can be like that. Messy and creating the worst possible disruption at the most inconvenient moments. Unlike scheduled downtime, which often happens overnight or at times of minimal traffic, you can guarantee that unexpected downtime will rear its head at your busiest moment.
Yet, this is to be expected. After all, sites are working hardest when they are stretched the furthest. This is precisely what Amazon faced during its recent Prime Day outage a period of catastrophic downtime during one of the company's busiest and most lucrative times of the year.
Customers reported issues with links not loading correctly and shopping carts mysteriously emptying. A shopper's nightmare, and Amazon's too.
While it's difficult to pin down exactly how much Amazon lost as a result of the outage, estimates from a number of data analytics firms suggest the figure could be as much as $100 million.
Amazon isn't the first to experience a serious outage, and it won't be the last. Last year British Airways suffered an IT crash affecting 75,000 passengers, for example. In fact, recent figures from the Ponemon Institute suggested that more than half of organisations globally are unprepared for IT outages.
Plan for the worst
Still, it's an ill wind that blows nobody any good. While Amazon takes a look at its internal systems and figures out how to try to prevent a similar thing happening in the future, we can all learn some lessons.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
One key piece of learning that applies to every site, no matter how large or small, is to plan for likely scenarios.
One solution may be to periodically put stress on a system to test it for weaknesses. Management Consultancy McKinsey published an analysis of Prime Day which discussed the use of SWAT teams made up of a range of skills such as merchandisers, product leads, customer service, fulfilment, media managers, and IT.
These teams can stress-test systems before they go live to find weak points that need addressing, and can be on hand during the event itself to help troubleshoot issues in real time.
This approach can be scaled down to even the smallest of organisations which can stress test new services or new website areas through their own mix of multidisciplinary skills.
Check on cloud provision
When some big new event or service is likely to pull in additional traffic, including many people that have never been to a site before, it's important to ensure there is plenty of bandwidth.
Reports suggest Amazon could have done better in this respect. McKinsey's analysis has noted that some waits for the Amazon site to load were as long as 20 seconds. Internal documents obtained by CNBC also showed that Amazon simply didn't have enough servers on hand to cater for traffic, forcing the company to display a simpler front page and suspend international traffic for a time to reduce workload. This all happened right at the start of Prime Day.
And it gets worse. The documents also revealed that the process for automatically adding servers based on demand, known as 'auto-scaling', may have failed. The result was that Amazon had to add servers manually, which takes time and is far less efficient.
Amazon also had issues with authentication and video playback, as well as a breakdown in communication with warehouses wanting to scan products and ship goods to those customers who were able to place orders.
"One of the most compelling use-cases for cloud computing is the highly variable need an event or campaign that runs for just a few days for instance," says Peter Groucutt, MD of disaster recovery firm DataBarracks. "You want to be able to scale up fast enough to service the demand but also scale down quickly to keep costs low. On Prime Day even the biggest public cloud service provider (Amazon Web Services) wasn't able to keep up with the demand of Amazon's Prime Day promotion."
Despite having its own cloud to work from, some may be surprised to learn that Amazon still relies on Oracle for a part of its infrastructure. Although Amazon hasn't disclosed exactly what percentage of its cloud is Oracle-based, we know that much of the company's e-commerce systems, built prior to the creation of AWS, are running on Oracle.
Amazon has said it plans to migrate the entirety of its systems to AWS within the next two years, which could make things easier to manage during the busiest periods. Despite Oracle's claim that its database technology far outstrips anything provided by AWS, Amazon clearly thinks it can operate independently.
Corporate messages don't fly on social media
Regardless of how prepared you are, outages will happen. However, how you handle such outages can be as important as trying to stop them in the first place.
As is often the case with gigantic international companies, Amazon has remained relatively quiet about the details of the outage, instead focusing on the great success that Prime Day was. In a press notice issued shortly after, it boasted that Prime Day generated more than $1bn in sales.
But there was a public statement made on Twitter, which acknowledged the issue. Unfortunately, it was also very upbeat the message being that regardless of how frustrated you are by the service, plenty of others have successfully bought their items. Oh, and there's still lots of time left.
Regardless of the size of your company, customers in these situations only care about two things that you're working on a solution, and that you're working on a way to reimburse them for their wasted time.
Amazon's response naturally drew outage, with over two thousand comments from customers, unappreciative of being told how well things were going otherwise, demanding answers from the retailer.
While Twitter is a great place for this first communication, it should always be accompanied by a prepared holding page at the website acknowledging and apologising for issues, and again explaining that you are working on resolving them. You might also want to have resources on hand to update social feeds, and the website information as the situation unfolds.
Amazon might truly be described as one of the world's mega-retailers, and as such its reputation may not be dented by the Prime Day outage. For most other organisations offering online services, such an outage may prove near-catastrophic for business, but provided you're planning to fail, and can communicate effectively with your customers, it needn't be fatal.
Image: Shutterstock

Sandra Vogel is a freelance journalist with decades of experience in long-form and explainer content, research papers, case studies, white papers, blogs, books, and hardware reviews. She has contributed to ZDNet, national newspapers and many of the best known technology web sites.
At ITPro, Sandra has contributed articles on artificial intelligence (AI), measures that can be taken to cope with inflation, the telecoms industry, risk management, and C-suite strategies. In the past, Sandra also contributed handset reviews for ITPro and has written for the brand for more than 13 years in total.
-
Should AI PCs be part of your next hardware refresh?
AI PCs are fast becoming a business staple and a surefire way to future-proof your business
By Bobby Hellard Published
-
Westcon-Comstor and Vectra AI launch brace of new channel initiatives
News Westcon-Comstor and Vectra AI have announced the launch of two new channel growth initiatives focused on the managed security service provider (MSSP) space and AWS Marketplace.
By Daniel Todd Published
-
AWS layoffs: Why Amazon is cutting staff from its most profitable division
News AWS layoffs follow a period of slowing growth and decreasing market share for the cloud division
By Ross Kelly Published
-
AWS invests $6 billion in Malaysia cloud expansion as SEA competition heats up
News While AWS continues expanding its footprint in Southeast Asia, Chinese competitors are edging into this expanding market
By Ross Kelly Published
-
Hyperscaler earnings 'highlight new era of maturity' in global cloud market
News Sluggish earnings for Azure, Google Cloud, and AWS could point to a more moderate cloud market in the year ahead
By Ross Kelly Published
-
AWS splashes $35 billion to expand data centres in Virginia
News The massive figure is close to the total sum AWS has previously invested in the state since 2006
By Zach Marzouk Published
-
AWS launches Australia's first local zone for low-latency workloads and data residency
News The company is aiming to help customers who need infrastructure closer to their data sources or end-users
By Zach Marzouk Published
-
AWS follows Google in opening a cloud region in Thailand
News The region is one of 24 other global regions announced by the company
By Zach Marzouk Published
-
Government holds talks with data centre operators over energy blackout threat
News One data centre operator has been preparing to switch over to diesel power in the event of a national blackout
By Zach Marzouk Published
-
NetApp teams up with VMware to help businesses migrate enterprise workloads
News Amazon FSx for NetApp ONTAP is the first native AWS cloud storage to be certified as a supplemental datastore for VMware cloud on AWS
By Daniel Todd Published