DR Scenarios For You and Your Business: Getting Cloudy With It

In the last post we talked about the more traditional models of architecting a disaster recovery plan. In those we covered icky things like tape, dark sites and split datacenters. If you’d like to catch up you can read it here. All absolutely worthwhile ways to protect your data but all of those are slow and limit you and your organizations agility in the case of a disaster.

By now we have all heard about the cloud so much we’ve either gone completely cloud native, dabbled a little or just completely loathe the word. Another great use for “somebody else’s computer” is to power your disaster recovery plans. By leveraging cloud resources we can effectively get out of the managing hardware business in regards to DR and have borderline limitless resources if needed. Let’s look at a few ways this can happen.

DRaaS (Disaster Recovery as a Service)

For now this is my personal favorite, but my needs may be and probably are different from yours. In a DRaaS model you still take local backups as you normally have, but then those backups or replicas are then shipped off to Managed Service Providers (MSPs) aligned with your particular backup software vendor.

I can’t particularly speak to any of the others from experience but CloudConnect providers in the Veeam Backup and Replication ecosphere are simple to consume and use. Essentially once you buy the amount of space you need from a partner you then use the link and credentials you are provided and add them to your backup infrastructure. Once done you create a backup copy job with that repository as the target and let it run. If you are bandwidth restrained many will even let you seed the job with an external hard drive that you ship them full of backups then all you have to transfer over the wire is your daily changes. Meanwhile all of these backups are encrypted with a key that only you and your organization knows so the data is nice and safe sitting elsewhere.

This is really great in that it is infinitely scalable (you only pay for what you use) and you don’t have to own any of the hardware or software licenses to support it. In the case that you have an event you have options; you can either scramble and try to put something together on your own or most times you can leverage the compute capabilities of the provider to power your organization until such time you can get your on-site resources available again. As these providers will have their own IT resources available you and your team will be freed up to do the work of getting staff and customers back online as they handle getting you restored and back online.

In my mind the drawbacks to this model are minimal. In the case of a disaster you are definitely going to be paying more than you would if you are running restored systems on your own hardware, but you would have had to buy that hardware and maintain it as well which is expensive. You will also be in a situation where workers and datacenter systems are not in the same geographical area as well which may cause for increased bandwidth cost as you get back up and running but still nothing compared to maintaining this consistently. Probably the only real drawback here is almost all of these types of providers require long-term agreements, 1 year or more for the backup or replication portion of what is needed. You also need to be sure if you choose this route that the provider has enough compute resources available to absorb you if needed. This can be mitigated by working with your provider to do regular backup testing at the far end. This will cost you a bit more but it is truly worth it to me.

Backup to Public Cloud

Finally we come to what all the backup vendors seems to be  going towards these days, public cloud backups. In this situation your backups are either on premises first (highly recommended) and then shipped off to the public cloud provider of your choice. AWS, Azure or GCP start messing with their storage pricing models and suddenly become cheaper? Simply just add the new provider and shift the job to the new provider, easy peasy. As with all things cloud you are in theory also infinitely scalable so you don’t have to worry about on boarding new workloads except for cost, and who cares about cost anyway?

The upside here is the ability to be agile. Start to finish you can probably be setup to consume this model within minutes and then your only limit to how fast you can be covered is how much bandwidth you make available to shipping backups. If you are doing this to cover for an external event like failure of your passive site you can simply tear it back down afterwards just as fast as you made it. Also you are only ever paying for your actual consumption, so you know how much your cost is going to be for any additional workload to be protected, you don’t ever pay for “spare space.”

As far as the drawbacks I feel like we are still in the early days of this so there are a few. While you don’t have to maintain your far end equipment for either backup storage or compute I’m not convinced that this isn’t the most expensive option for traditional virtualized workloads.

Hybrid Archive Approach

One of the biggest challenges to maintaining an on-prem, off-prem backup system is we all run out of space sometimes. The public cloud provides us an ability to only consume what we need, not paying for any fluff, as well as letting others manage the performance and availability of that storage. One trend I’m seeing more and more is the ability to supplement your on premise backup storage with public cloud resources to allow for scale out of archives for as long as necessary. There is a tradeoff between locality and performance, but if your most recent backups are on premises or well-connected to your production environment you may not ever need to access those backups that archived off to object storage so you don’t really care how fast it is to restore; you’ve just checked your policy checkbox and have that “oh no” backup out there.

Once upon a time my employer had a situation where we needed to retain every backup for about 5 years. Each year we had to buy more and more media to save these backups we would never ever restore to because they were so old, but we had them and were in compliance. If things like Veeam’s Archive Tier or similar with other vendors existed I could have said “I want to retain X backups on-prem but after that shift them to a S3 IA bucket.” In the long-term this would have saved quite a bit of money and administrative overhead and when the requirement went away all I had to do is delete the bucket and reset back to normal policy.

While this is an excellent use of cloud technology I don’t consider it a replacement for things like DRaaS or Active/* models. The hoops you need to go through to restore these backups to a functional VM are still complex and require resources. Rather I see this as an extension of your on-prem backups to allow for short-term scale issues.

Conclusion

If you’ve followed along for both posts I’ve covered about 5.5 different methods of backing up, replicating and protecting your datacenter. Which one is right for you? It might be one of these, none of these or a mash-up of two or more to be honest. The main thing is know your business’ needs, it’s regulatory requirements and

DR Scenarios For You and Your Business Part 1: The Old Guard

It is Disaster Recovery review season again here at This Old Datacenter and reviewing our plans sparked the idea to outline some of the modern strategies for those who are new to the game or looking to modernize. I’m continually amazed by the number of people who I talk to that are using modern compute methodologies (virtualization on premises, partner IaaS, public cloud) but are still using the same backup systems they were using in the 2000s.

In this post I’m going to talk about some basic strategies using Veeam Backup and Replication because that is primarily what I use, but all of these are capable with any of the current data backup vendors with varying levels of advantages and disadvantages per vendor. The important part is to understand the different ways about protecting your data to start with and then pick a vendor that fits your needs.

One constant that you will see here is the idea of each strategy consisting of 2 parts. A local backup first to handle basic things like a failing VM, file restore, and other things that aren’t responding to all systems down. Secondly then archiving that backup to somewhere outside of your primary location and datacenter to deal with that systems down or virus consideration. You will often hear this referred to as the 3-2-1 rule:

  • 3 copies of your data
  • 2 copies on different types of physical media or systems
  • 1 copy (at least) in a different geographical location (offsite)
On Premises Backup/Archive to Removable Media Backup

This is essentially an evolution on your traditional backup system. Each night you take a backup of your critical systems to a local resource and then copy that to something removable so that it can be taken to somewhere offsite each evening. In the past this was probably only one step, you ran backups to tape and then you took that tape somewhere the next morning. Today I would hope the backups would land on disk somewhere local and then be copied to tape or a USB hard disk but everybody has their ways.

This method has the ability to get the job done but has a lot of drawbacks. First you must have human intervention to get your backups somewhere. Second restores may be quick if you are restoring from your primary backup method but if you have to go to your second you first have to physically locate that correct data set and then especially in the case of tape it can take some time to get that back to functional. Finally you own and have to maintain all the hardware involved in the backup system, hardware that effectively isn’t used for anything else.

Active/Passive Disaster Recovery

Historically the step up for many organizations from removable media is to maintain a set of hardware or at least a backup location somewhere else. This could be just a tape library, a NAS or an old server loaded with disks either in a remote branch or at a co-location facility. Usually you would have some dark hardware there that could allow systems to be restored if needed. In any case you still would perform backups locally and maintain a set on premises for the primary restore, then leverage the remote location for a systems down event.

This method definitely has advantages over the first in that you don’t have to dedicate a person’s time to ensuring that the backups go offsite and you might have some resources available to take over in case of a massive issue at your datacenter, but this method can get very expensive, very fast. All the hardware is owned by you and is used exclusively for you, if ever used at all. In many cases datacenter hardware is “retired” to this location and it may or may not have enough horsepower to cover your needs. Others may buy for the dark site at the same time as buying for the primary datacenter, effectively doubling the price of updating. Layer on top of this the cost of connectivity, power consumption and possibly rack space and you are talking about real money. Further you are on your own in terms of getting things going if you do have a DR event.

All that being said this is a true Disaster Recovery model, which differentiates from the first option. You have everything you need (possibly) if you experience a disaster at your primary site.

Active/Active Disaster Recovery

Does your organization have multiple sites, with datacenter capabilities in each place? If so then this model might be for you. With Active/Active you design you multisite datacenters with redundant space in mind so that in the case of an event in either location  you can run both workloads in a single location. The ability to have “hot” resources available at your DR site is attractive in that you can easily make use of not only backup operations but replication as well, significantly shortening your Restore Time Objective (RTO), usually with the ability to rollback to production when the event is over.

Think about a case where you have critical customer facing applications that cannot handle much downtime at all but you lose connectivity at your primary site. This workload could fairly easily be failed over to the replica in the far side DC, all the while your replication product (Think Veeam Backup & Replication or Zerto) is tracking the changes. When connectivity is restored you tell the application to fallback and you are running with changes intact back in your primary datacenter.

So what’s the downside? Well first off it requires you to have multiple locations to be able to support this in the first place. Beyond that you are still in a world of needing to support the load in case of having an event, so your hardware and software licensing costs will most likely go up to support this event that may never happen. Also supporting replication is a good bit more complex than backup when you include things like the need for ReIP, external DNS, etc. so you should definitely be testing this early and often, maintaining a living document that outlines the steps needed to failover and fallback.

Conclusion

This post covers what I consider the “old school” models of Disaster Recovery, where your organization owns all the hardware and such to power the system. But who wants to own physical things anymore, aren’t we living in the virtual age? In the next post we’ll look at some more “modern” approaches to the same ol’ concepts.