Why knowing your data is key to disaster recovery

What is the most important task carried out by your company or clients' IT infrastructure? Take a second to think about it. It's a key question regardless of whether you're integrating technologies into your own firm, or attempting to plan an integration strategy for an ongoing or prospective customer.

Picture all the IT processes going on at any one time (perhaps it'd be best to grab a pen and paper here) and map them out. What feeds into what? Which tasks and databases could you 'turn off' and still be able to run the business at near optimal capacity?

It's not as simple a question as you might have thought, but it's one that should be asked whenever integrating or selling any new technology. With IT outages increasingly plaguing organisations across all sectors, it has never been more important to know your systems inside and out to understand which workloads are dependent on one another, and to have a mechanism that can get you back online quickly.

Test your systems to death

There are a host of data that play a role in your firm's infrastructure. Generally speaking, there are 13 types: big data; structured, unstructured and semi-structured data; time-stamped data; machine data; spatiotemporal data; open data; dark data; real-time data; genomics data; operational data; high-dimensional data; unverified outdated data; and translytic Data.

The definitions of each kind get increasingly murkier, and it's not necessary to know the minutiae of what each byte is made up of. The main point is that it's vital to know what processes the data is responsible for and how readily available it is.

The only way to know all of this is to carry out a full system audit, and to begin with it can be quite an ugly sight. You don't know what you need unless you know what you're missing, which is why it's so important to carry out disaster recovery (DR) testing.

A DR test puts the IT infrastructure in a worst-case scenario and provides administrators with the knowledge of how long it would take them to recover data, restore business-critical applications and resume normal service. It's a practice that can be neglected, but it's necessary nonetheless. Without DR testing and the subsequent creation of a disaster recovery plan, organisations will likely see that the IT system that they may have spent a lot of money on is ill-equipped to execute as and when it is needed

There are different types of DR testing available to you. The simplest, the plan review, is where the administrators responsible for the plan will thoroughly scrutinise it to identify any inconsistencies or missing elements. The 'tabletop test' meanwhile involves stakeholders walking through the plan step-by-step to determine if they know what they have to do, and if there may be any errors.

The most intensive is a full-blown simulation where a variety of disaster situations are played out to see if and how quickly the team can get crucial systems back online and fully operational. All of these methods have their pros and cons, but all work towards the common goal of giving you a full overview of the system.

There's no 'silver bullet' for IT outages

With a full understanding of the system, administrators can then implement mechanisms to get their systems back online, and fast, in the face of a disaster. The best way to do this is by adopting a zero-day approach to architecture. This approach allows organisations to prioritise workloads, and minimises downtime without having to worry about lost data.

A zero-day recovery architecture is a service that enables administrators to quickly bring work code or data into operation in the event of any outages, without having to worry about whether the workload is still compromised.

An evolution of the 3-2-1 backup rule (three copies of your data stored on two different media and one backup kept offsite), zero-day recovery enables an IT department to partner with the cyber team and create a set of policies which define the architecture for what they want to do with data backups being stored offsite, normally in the cloud. This policy assigns an appropriate storage cost and therefore recovery time to each workload according to its strategic value to the business. It could, for example, mean that a particular workload needs to be brought back into the system within 20 minutes while another workload can wait a couple of days.

Anyone who claims to have a miraculous 'silver bullet' for IT outages that prevents them altogether should be treated with a great deal of suspicion. We can mitigate the effects, but we can't eliminate them altogether. Whether it's a malicious case of ransomware, or simply an environmental situation like power outages or a loose connection, outages will happen. The most important thing IT administrators can do now is to make sure that when those outages do happen, downtime is minimised and recovery time is as quick as possible.

Alex Fagioli is CEO of Tectrade