How to prepare for cloud outages

published 20 May 2011

When Amazon’s Elastic Compute Cloud (EC2) came to a juddering halt this April, the outage was traced to a data rerouting error made in a storage network in the US state of Virginia. The company explained the problem succinctly, "As with any complicated operational issue, this one was caused by several root causes interacting with one another."

As official post-mortem statements go, that’s arguably a fairly insipid offering; but the bigger problem here is that it does not give us a clear foundation for the what, where and when of how we should plan and provision for such calamitous events should they happen again.

A secondary cloud supplier is an expensive option, but what if a business can't afford that level of redundancy to cover its IT stack and network? How then should we plan for the ‘browned out’ cloud?

"Looking at cloud disaster recovery(DR) options; selecting a provider with a secondary cloud, or indeed two providers are options,” said Ian Masters, UK sales director at business continuity company Vision Solutions.

Masters continued, "Another option is to take this job on yourself. Running separate cloud services for DR with different providers means that you as the customer will take on more of the management headache around making the DR service work; however, it does also mean that there is greater independence around the set up - and therefore less opportunity for disruption of services if one of the cloud providers goes down.

"This also deals with one of the bigger challenges of the cloud’s reliance onInternet connectivity; if the connection goes down between you and the first cloud provider, you can carry on using the second connection and provider,” said Masters.

The weight of debate

So the debate is starting to 'weigh up' what the cost benefits of the cloud are, compared to the potential costs of protecting a cloud strategy with a back-up plan. VP & CTO of CA Technologies Colin Bannister has gone on the record saying that we need to realise that outages like Amazon’s will probably happen again, so businesses need to have the right back-up plan in place.

"This could include having a 'mirrored’ data centre in another part of the country on another power grid, connected to multiple network backbone providers, or providing redundant copies and failover machines in the data centre. All of a sudden this sounds a lot like the typical set-up a bank would have; so perhaps there is a lesson there for other businesses,” said Bannister.

Of course, the need to replicate your whole cloud set-up using a second vendor’s services would technically raise the cost of cloud computing. But says CA’s Bannister, we also need to remember that cloud has many other advantages beyond cost (elasticity, scalability, ubiquitous access, pay-per-use etc.), so we should approach each deployment on a case-by-case basis.

Let’s go back to basics

OK so what do we know so far? Cloud environments need to have multiple access points for end-users of the applications; most cloud services are Internet-based; and a web-based cloud service provider exposes many of these access points online-- and herein lies the fragility. Cloud computing product manager at Cable&Wireless Worldwide Neil Thomas advocates a more WAN-based approach to cloud, which he says removes much of these risks.

"The best way to safeguard against redundancy, disaster recovery, brown outs, cloud outs and other cloud issues is a WAN-based approach combined with strict service level agreements with your cloud provider,” said Thomas. “WAN connectivity for server administration also allows IT professionals to build and test cloud environments easily before access is extended to users,” he added.

Our greatest truism to embrace here must surely be that single points of failure are hard to eliminate. We know that cloud computing can eliminate some obvious single points of failure, but there can still be other failure modes hidden away. This is the view of independent systems engineer, project manager and industry commentator Dr Graham Oakes.

"Identifying and eliminating every possible single point of failure is hard work and requires expertise. It also requires transparency into how things are configured across the stack – and this isa level of transparency which cloud vendors are often reluctant to give (not least because it restricts their flexibility to reconfigure as demand patterns change). Without full transparency, you have to design with resiliency on the basis of worst case assumptions, i.e. assuming that the vendor’s services will fail at some point. Yet this is the opposite of what most of the vendors’ marketing suggests,” said Oakes.

Oakes goes on to point out that the challenge presented here is huge. After all, resilient design is complex as it throws up tasks such as database synchronisation across multiple datacentres. To make this more challenging, resilient design is expensive too. You need to pay for additional components, additional network traffic etc.

“You also need to pay for expertise to design and operate the systems. You need to cover the costs of regular testing - and so on. For many organisations, it may be very hard to justify all this cost against the loss of revenue from an occasional outage. It’s a perfectly valid business decision to decide to run with some single points of failure. The hardest thing to calculate in this equation is probably the reputational cost of an outage,” said Oakes.

Life is difficult

If Amazon’s recent debacle teaches us anything, it is surely that life is difficult. It also perhaps teaches us that that cloud computing services, for all their outsourced and automated worth, are still centrally controlled by human beings. To err is human yes, but to err in the cloud is inhuman, or at least it should be.

Get the ITPro daily newsletter