That’s how long the disaster recovery (DR) manual is for a midsized law firm that I recently visited in the US. Now, to many of us, that might not seem so bad. Four hundred pages is a typical novel length, so it’s no big deal. Right?
In the world of DR that is a big deal.
Think about it. Something goes wrong resulting in an IT outage. It could be a natural disaster, a spilled cup of coffee, or backhoe that takes out a critical cable. Disasters happen all the time. The problem only starts with the disaster; it’s the recovery that can be as painful, if not more so.
Now imagine you’re in charge of recovering from this IT outage and you need to follow 400 pages of manual, step-by-step instructions to bring your business -- and I really do mean business, given IT is the lifeblood of the modern enterprise -- back online. Imagine the power in your home goes out and you had to flip through 400 pages of instructions to restore TV, internet, and phone. I don’t know about you, but I would just opt to live offline for a few days. Unfortunately, your business can’t afford that “luxury.”
Given this anecdote, it’s no surprise that one study found, on average, businesses suffer 2.2 days of IT downtime, costing nearly $400,000. It’s also no surprise that human error and mistakes are the No. 1 contributor, exacerbating outages due to natural disasters with self-inflicted downtime. Still not worried? Then consider this:
In the past two years, more than 50% of businesses experienced an unforeseen interruption, and the vast majority (81%) of these interruptions caused the business to be closed one or more days.
The survival rate is less than 10% for companies without a disaster recovery plan.
So what are the common ways companies avoid or minimize disasters? This SearchDisasterRecovery article does a good job of summarizing three best practices:
Prioritize disaster recovery applications and services. Conduct a business impact assessment, determine tier 0 services (services that must be online for apps to work), and then tier applications based on business priority. Automate how these services are recovered.
Don’t overlook RTOs and RPOs. Understand and map the dependencies among applications, as well as between apps and their data. Make sure systems are recovered in a manner where you can hit your recovery time and recovery point objectives (RTO and RPO).
Keep up with data replication. Ensure critical data is replicated among different failure zones and employ synchronous replication and asynchronous replication according to best practices 1 and 2 above. Coordinate replication efforts at the infrastructure, database, and application layer to make sure you’re not over or under replicating.
It’s this third practice that I want to further discuss. I came across a great Storage Decisions video from Jon Toigo. He talks about the value of a storage hypervisor (his term for software-defined storage) and data replication. In short: the flexibility of software-defined storage with built-in data replication changes both the capital and operational burden associated with data protection. It eliminates disparate infrastructure -- and the management overhead of such solutions -- and couples it with the cost savings of running virtualized storage on commodity servers.
But data replication is an evolving capability within the relatively new category of SDS. As Jon described, modern software-defined storage (SDS) platforms have replication built in. That means the good news is that as a DR planner and IT admin you’ve never had more choices in terms of how to replicate data. But the bad news is . . . well . . . you’ve never had more choices. The key is rethink how you replicate data to protect your critical apps and services. Gone are the days where data replication is a painful, static, one-way flow of data from site A to site B.
So how do you use SDS and data replication to your advantage in your DR process? Here are three tips:
Tip #1: Tune your storage replication factor based on criticality. As discussed above, you should align DR techniques with how critical the applications and data are to your business. Tier 0 and tier 1 apps can and should be protected differently than tier 2 and below. Use a three-way replication factor as minimum for tiers 0-1, but don’t be afraid to go to a four-way (or higher) replication factor if business requirements dictate it.
Tip #2: Uses multiple sites to improve availability. Replicating your data to three nodes in a single data center is good. Replicating data to three different nodes in three different data centers is even better. Of course, you need to plan around network connectivity and latency, but a good DR plan will have already factored these in.
Tip #3: Add a cloud DR site. For even better protection, don’t just replicate to your own sites. Add a public cloud as a DR storage site. Run an instance of storage software in AWS, Azure, Google, or the cloud provider of your choice. Coupling multi-site with a public cloud helps mitigate the risk that an issue affecting your on-premise infrastructure will take down all sites, such as firmware upgrades.