Disaster Recovery 101:

Some time ago, I was engaged in a discussion with one of our customers to investigate the possibility of different backup and disaster recovery solutions implemented in their AWS cloud environment. The discussion soon developed into discussing the technical capabilities and benifts of each solution and when I asked what their RPO or RTO requirements were, they couldn't provide an answer simply to the fact they were not privy to the terminology or fundamentals of disaster recovery. After I clarified on the white board I came to the conclusion that they were probably not alone and that it might be best to write an article discussing disaster recovery 101.

As the old adage goes ‘If you fail to plan, you are planning to fail', if you take this adage and apply it to IT systems you get disaster recovery (DR). Disaster recovery is all about planning your IT strategy around the fact that hardware has a finite lifespan, natural disasters will happen, applications will have bugs, and humans will make mistakes. This planning is made up of a set of contingency policies and procedures that if ‘x' disaster occurs then we will react with a corresponding plan of action to control any damage and reduce IT systems downtime.

For most organizations, a disaster means an abrupt disruption of all or part of business operations, which typically directly results in revenue loss. With this in mind, when you are creating your disaster recovery strategy carefully consider the plethora of IT DR supporting solutions like backups, infrastructure high availability, geographic redundancy, etc. and choose a set of solutions that you believe will best compliment your organization's needs and protect revenue streams.

disaster_recovery_costtorecovery

Common Definitions:

  • Recovery Point Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, If the maximum amount of data loss you can tolerate is 15 minutes, your RPO is 15 minutes.
  • Recovery Time Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part is carried out by a system administrator, network administrator, storage administrator etc.
  • Work Recovery Time (WRT) determines the maximum tolerable amount of time that is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks are performed by application administrator, database administrator etc. When all systems affected by the disaster are verified and/or recovered, the environment is ready to resume the production again.
  • Maximum Tolerable Downtime (MTD) which defines the total amount of time that a business process can be disrupted without causing any unacceptable consequences. This value should be defined by the business management team.
  • Business Impact Analysis (BIA) is a process to determine the impact a disaster would have on an organization, typically your RPO,  RTO, WRT, and MTD metrics are developed from a BIA report.

Visualizing Disaster Recovery Planning:
disaster_recovery_visualized

Creating a Disaster Recovery Plan:

Disasters are inevitable, mostly unpredictable, and they vary in type and magnitude. The best strategy is to have some kind of disaster recovery plan in place in advance to minimize the impact of an unexpected disaster. Below is a general guide of phases and considerations to plan out in advanced. These phases are broken out into a little more detail than NIST 800-34 but are still fundamentally rooted in NIST 800-34 Contingency Planning Guide for Federal Information Systems.

  • Phase 1: Disaster Assessment and Risk Analysis
    The first phase of a disaster recovery plan involves assessing the amount of damage caused and the further extent of damage that will occur if a recovery plan is not used for mediation. The disaster recovery plan must clearly identify the team members who will be responsible for identifying, notifying and accounting the damage. The assessment usually includes:
  • Determining the root cause of the disaster
  • Determining the likelihood and extent of further damage
  • Determining the scope of the disaster including affected systems, users, and business operations
  • The estimated time available for dealing with the disaster without exceeding the organizational MTD
  • Carrying out a detailed risk analysis is another important activity that must be completed during this first phase.
  • Phase 2: Activation and Planning
    This second phase in a disaster recovery plan involves pulling together a team who will actively participate in planning and executing a disaster recovery solution. The role of each and every team member must be clearly defined. Once the team members are together, they have to begin devising a disaster recovery plan to tackle the situation and restore normalcy. Some of important aspects of this planning are:

    • Listing what all will be restored and also assigning priorities to the items to be restored
    • Detailing out the procedures to be followed
    • Allocating roles to team members
    • Setting up a communication, reporting, and review system
    • Setting up time lines and schedules for activities to be performed
    • Allocating resources and equipment
    • Setting up operating and quality standards
    • Identifying and importing the required data sources
    • Setting up review procedures and review points
    • Documenting the recovery plan
  • Phase 3: Execution of the Disaster Recovery Plan
    In the execution phase, the recovery team finally gets into action and begins executing the recovery activities as per the procedures specified in the plan. At the end of each phase of the recovery, or after execution of the important recovery activities, a review or appraisal must follow to monitor the progress and ensure compliance with the established quality standards.
  • Phase 4: Recovering from the Disaster
    The recovery phase is the period of time in which systems are brought back online, often times in a temporary location or configuration. Communications between IT and users occurs extensively during this phase, as people and systems are restored to an operational state.  Restoration priorities is an area that must be well defined during the planning phase and updated on a regular basis. Plans and procedures for systems recovery are critical at this junction because it will drive what needs to be restored, and in what order, due to application dependencies.
  • Phase 5: Reconstitution and Restoration
    This final phase in a disaster recovery plan follows after the disaster has been completely managed and it is time to get back to restoring normalcy. The reconstitution phase is the period of time where operations are returned to a ‘steady-state’, system data and functionality is verified as normal, and cleanup actions occur. The last component is the compilation of input from team members on their observations, and updating of all documentation to reflect the current operating state and lessons learned. Here are some of the activities that form a part of the restoration and reconstitution phase:

    • Ensure that there are no remaining aftereffects of the disaster and that no threats have remained unaddressed
    • All team members have returned to their original roles
    • Perform a post-disaster analysis to validate the root cause, determine the efficiency of DR execution and to extrapolate lessons to be learned.
    • The disaster recovery efforts are completely over.

Looking for More Information on Disaster Recovery?

GoVanguard is a leading manage services provider of next generation BDR solutions and is strategically partnered with several solution providers like VMWare, Veeam, Microsoft and more.

  • Backup and Recovery
    Solutions with the ability to restore baremetal servers and even an entire office of computers, real-time into a cloud-based virtual environment or on-premises.
  • High Availability
    Deliver application and data availability within the data center and over distance with full infrastructure utilization and zero downtime.
  • Stretched Clustering and Cloud Redudancy
    Stretched instrastructure over distance for new levels of availability and data mobility. Move applications, virtual machines, and data in and between data centers without impacting users.