Service level agreements and disaster recovery

Make sure you manage user expectations when you create a service level agreement for disaster recovery operations.

One difficult aspect of disaster planning that will probably be
the first thing you have to tackle and will determine everything else that you
include in your plan is the service level agreement
(SLA) that you negotiate with your users. SLAs are
essentially the promise you make to your end users about how long a system will
remain unavailable during an emergency. SLAs are made up of Recovery Time
Objectives (RTOs) and Recovery Point Objectives (RPOs) and are often highly
influenced by end-user perspectives and prejudices, making them a very
difficult concept to deal with on a technical level.

RPO is a measure of the amount of data that can be lost to a
disaster. For example, if you use tape backup once per day, your potential RPO
is one day’s worth of data if the disaster strikes at the worst possible time. RTO
is the measure of how long the systems can be offline during a disaster. An
example of this is the amount of time it would take to bring the standby
systems online with a replication and failover solution. These two metrics will
allow you to create a measurable SLA that can be presented to the end-user
community, letting them know when their systems will be back online and what
they can expect to see when the process is complete. However, these metrics
alone can’t help you if you don’t know what your end-users expect from the DR
systems to begin with.

End-user requirements are a double-edged sword. On the one
side they can provide you with definite guidelines as you begin to determine
how quickly these systems must be back online. On the other side, end-users
tend to be unrealistic in their demands for zero data loss and instant
failover. While that can be accomplished in only a small subset of cases, the
vast majority of data systems cannot possibly withstand these types of failover
“requirements” due to the operating systems they run on, the
structures of their data systems, or the very nature of the tools that would be
required to perform these operations.

Of course, your budget also comes into play in SLA
discussions. The closer you get to a zero-loss number in RTO and RPO, the
higher the cost of the overall solution. The way the cost-curve is based, if
you go much below the average allowances in either RTO or RPO, then you’re
looking at astronomical jumps in funding requirements. Once they see the
budget, end-users often radically revise their SLA requirements, which opens
many more options
for DR planning.

SLAs can offer a great way to let your end-users know exactly
what will happen in an emergency and how quickly they can anticipate getting
back online. Involving end-users from the start, educating them about budget
and technology, and making sure they remain informed is vital to the creation
of a valid SLA. DR planning is for the benefit of these end users, and an SLA
can realistically set their expectations and define their roles during the
planning process.

How well can your organization deal with an emergency? Automatically sign up for our free Disaster Recovery newsletter, delivered each Tuesday, and make sure you’re prepared for the next catastrophe.

Subscribe to the Project Management Insider Newsletter

Subscribe to Project Management Insider for best practices, reviews and resources. From project scheduling software to project planning apps, stay up to date with the latest in project management tools. Delivered Wednesdays