“When are we done with IT Disaster Recovery?” “How much is good enough?” Chief information officers, and the boards of directors that they report to, want to know when they can stop spending money to fix the gaps. The gap, and in some cases the chasm, is the gap between their customers’ expectations for recoverability and their ability to reasonably demonstrate that they can recover. Customers expect that the business services that a company provides will be there when they need them. They understand that there may be an outage. The term “glitch” – defect or malfunction in a machine or plan – has become commonplace today as companies explain how reservations systems, ecommerce sites or mobile apps were unavailable. It’s when those glitches turn into extended outages that are measured in days versus minutes.
Glitches that Impact Customers
Glitches stem from new applications that are intertwined with the large databases of information that companies are built upon. Business services are created by weaving together disparate systems. A corporation builds productivity by connecting that 25-year-old logistics system with the new distribution system that is “in the cloud” and the transportation system that was provided by that huge third-party company. That business service is interrupted when one of those delicately intertwined systems is unavailable. Those glitches turn into outages when those interruptions cascade into the other systems, ultimately creating an outage that impacts their brand and reputation with customers.
Glitches come in different shapes and forms. It could be a standard business process that somebody didn’t follow. It could be new cloud providers that had a failure in one of their regions. We still find that technology assets can have failures – who would imagine? It doesn’t really matter. What matters is the business service that delivers value to customers is unavailable for an extended amount of time.
Boards of directors have read about and experienced those disruptions. The fiduciary responsibility they have to a company prompts them to ask questions of the CIO – “Could that happen to us?” “Are we prepared?” Those business section headline events have prompted executives to act. That action can be broken down into four high level steps:
- Identify…those business services that are the heartbeat of the company;
- Determine…the risks and potential of a big outage;
- Invest…the time, effort and money to address those risks; and
- Verify…the investment was effective by doing a test.
Performing those four steps can be a significant effort. Especially when considering that most technology professionals believe they already have no time to do their day-to-day jobs. It can be a daunting task to reflect on the hodgepodge of applications and systems that make up the vital organs of a company and determine if any of them could shut down. Agile thinking and the ability to pivot quickly are the buzzwords of the day; however, those buzzwords don’t lend themselves to the task of breaking down years of interconnected systems, applications and vendors.
Addressing the Glitches
The first step in the process is to identify those business services that deliver value to the customer or, in certain cases, end-users. The term “end-to-end business transaction” has been used to describe a business service. It is typically composed of multiple business processes and supported by several systems. The process of identifying those heartbeat business processes will require effort and analysis, but this doesn’t demand a Business Impact Analysis (BIA) be performed. A BIA can be defined as the process of analyzing operational functions and the effect a disruption might have upon them. This first step is meant to be driven “top down” – by having an executive discussion about which things absolutely must be performed.
- For a retailer, it is getting product to customers and enabling them to pay.
- For a fast food company, it is enabling them to order and get food.
- For a financial institution, it is the ability to clear and settle transactions.
Once those handful of business services are identified, the company can begin to understand the composition of those services. Understanding the business processes, supporting applications and third parties that help complete that end-to-end transaction. You focus on the heartbeat services because the effort to decompose the business services can be time-consuming. By completing this process for those business services that are vital to the company, it shows an appreciation for driving results.
The second step is to identify those places where the glitch could occur. There are many methods and approaches to doing this. They differ in formality and rigor. This step should not turn into a hypothetical exercise. It should investigate a handful of questions to determine the risk of potential disruptions. That can include, but is not limited to, questions such as:
- Have there been failures in the past?
- Are there single points of failure?
- Are we solely dependent on a third party or on specific individuals?
- Should there be a redundant system in another location?
- Is this data replicated?
The most important two questions are:
- How long would the system take to recover and restore from a disruption?
- Does that align with the customers’ expectations?
This is when a CIO can go back to the board of directors and answer the questions, “Could this happen to us?” and “Are we prepared?” This answer should be done succinctly and in business language. If there is a misalignment between customer expectations and the ability to recover, it will require some investment. That leads to the third step in this process.
The third step is where most companies fail. They fail to allocate the capital and operating budget to address the gaps in recoverability. Right behind that is the challenge in assigning the necessary professionals to address the gaps. Implementing these strategies and mitigating the risks in an IT Disaster Recovery (DR) capability requires a combination of technical knowledge, program management and organizational nimbleness that is difficult to find.
A company can’t invest until it has estimated and allocated the necessary capital and operating budget to address these problems. Presenting the business case for investment in simple illustrative business language is the biggest impediment to this goal. When presenting to the CIO or board of directors, it is imperative that the following things are done:
- Be Realistic – don’t try to scare them with fear, uncertainty and doubt.
- Be Conservative – about the objectives that will be achieved given the level of investment.
- Be Concise – disaster recovery is not a topic that board members want to discuss.
Success results in the necessary funding and resources to address the risks and gaps that could cause an extended disruption to those heartbeat business services. That process could take months or years to accomplish. This is typically not something that can be addressed in weeks. That is why it is so important to be conservative in estimates for completion.
What is not described here is the process for implementing the strategies, process and solutions needed to mitigate those risks. The solutions and approaches will be unique for every company. It will depend on the size of their company, the criticality of those business services and the marketplace in which they perform. The fourth step comes into play after these solutions are implemented.
Stated simply, the fourth step is to verify that the business service can now be restored when customers expect it to. The CIO and the board of directors should now have a reasonable level of confidence that the services that are most critical to their customers are protected against an extended disruption. That is accomplished by performing some type of exercise, test or validation. What form that takes is irrelevant. What is important is that it gives the company a level of confidence.
Once completed, the organization should close the loop. A final presentation should be done to executives to articulate how that validation was accomplished and the corresponding results. The results can take any form; however, it is important that they shouldn’t be qualified. The results should correspond to the budget objectives that were described given the level of investment.
Are we done with IT Disaster Recovery?
Performing these four steps does not mean an organization is resilient. It doesn’t mean that the company has implemented an IT Disaster Recovery program. What these four steps have done is communicated the value that resilience and recoverability can bring to the board of directors. Board members don’t see a lot of value in business continuity plans or BIAs. They struggle with the amount of effort that is required to determine recovery objectives for business processes. What board members do understand is addressing customer expectations.
This four-step process measures a company’s ability meet customer expectations. It encourages a process of describing the heartbeat business services. It results in the executive’s ability to understand the risks of a glitch on those business services. Ultimately, it gives them a perspective on the business continuity risks that the company is exposed to.
It gives them the insights to answer the questions:
- Have we invested sufficiently to manage those business continuity risks?
- Have we done what is necessary to give us confidence in our ability to recover from a big glitch?
It is their responsibility to make the appropriate decisions once they have the information.Recommended1 recommendationPublished in