Failback: A key to disaster recovery
While many aspects of a quality business continuity plan exist, let's focus on failback. Failback is the term for moving your IT environment back to primary production after a disaster. It's essential to know your failback options because failover is not a permanent state.
Here are a few issues to consider:
First, your company likely has to adhere to compliance regulations, so take these into account when designing your failover site. PCI DSS, for example, no longer allows a customer to run applications in a disaster recovery environment that doesn't have an identical level of security as the normal infrastructure. You may need to add security measures to the DR site (like file integrity scanning and security event log management) to resume processing cardholder data.
The next issue is the application stack. Many software products, such as Microsoft's SQL Server, have licensing limits to address prior to running production workloads for an extended time.
The third issue is capacity. Suppose your IT business continuity plan was designed to operate at 20% of your production capacity. The longer you remain in your failover site, the greater your chance of running up against your resource limit.
Finally, there are limitations if you leverage a third-party company to assist in your data backup and recovery plan. Most restrict how long you can occupy the space they provide before you either transition your environment back to its original site or convert your contract with the vendor to make the site permanent.
In addition, consider:
- The time needed to acquire replacement hardware
- The method used to failover to your disaster recovery (DR) site (typically be the same process followed to failback)
- Costs (hardware, software, facility, failover declaration fees ) associated with turning your DR site into a permanent site
- Your recovery time objective, or how long the failback method will take ‑ this can sometimes be more painful than the actual failover event
Failover site selection and failback capability
Your choice of a failover site will also affect the capabilities of your failback, whether it entails restoring existing infrastructure, buying new infrastructure, or moving to a production cloud.
Colocation:
Using colocation services with failback to existing infrastructure is relatively inexpensive but can be labor- and time-intensive. Restoring large amounts of data (more than five or ten terabytes) can take days when restoring it from tape.
This strategy becomes less effective when new infrastructure is required. Capital expenditure costs run high, as do labor costs. Buying and configuring new hardware while simultaneously restoring five to 10 terabytes of data, all within a tight recovery window is going to be a serious challenge.
Public cloud:
A public cloud service offers an easy option for storing data during a DR event. With low front-end costs, it's great for small businesses with limited IT staff, or that lack in-house data protection services.
But if you're a mid-sized company facing a disaster, prepare yourself. Back-end costs can run very high here. Retrieving data is expensive, given the scope and scale of what you likely have stored there. You will pay for every gigabyte you take out.
DRaaS:
With a Disaster Recovery as a Service solution, you failover to a predesignated DR cloud. Failover is smooth because your DRaaS provider helped design the solution and tested it to ensure any issues were worked out.
DRaaS solutions typically leverage continuous data protection technologies that offer low levels of data loss by replicating production data. Mission-critical production can operate throughout the recovery period, as server security configurations and network services are duplicated at the DR site.
The failback options for DRaaS are the same as colocation and public cloud services: restoring existing infrastructure, buying new infrastructure, or moving to a production cloud. In any of these cases, the labor involved in failback falls on the service provider, saving you time when you need it most.
The failback event itself is simpler, too. It requires only three steps:
- Recover your infrastructure (or keep production in the cloud)
- Reload your hypervisor
- Install one virtual machine
From here, your DRaaS provider can take over the replication and eventual failback into your production environment.
Additional benefits include:
- Minimized downtime, protecting your company from financial losses
- Lessened risk of compliance penalties
- Stronger security against breaches. Often, a company's DR site holds all its data, but without the security to protect it, the DR servers are often neglected and lack critical security patches. With a DRaaS solution, the DR VMs stay in sync ‑ typically in a powered-off state, so the data is not even accessible until the time of declaration
However, not all disaster recovery companies are created equal. Some have production capacity, while others do not. Ask in advance.
Conclusion
Your failback process is as important as your failover strategy. The issues — how long your methodology will take, the labor involved, the ease of transport, and the costs of returning to or finding a new production site — should not be left to resolve during the recovery period. That's when company resources will be tight, and your energy diminished. Address these things now to prevent disaster later. Contact us if you'd like assistance developing your failover strategy and process!