Migrating to an Always-On Solution

By Damian Walch and Matthew Zielinski|2022-03-29T19:27:52+00:00January 1st, 2008|0 Comments

For the past 18 to 24 months, a new focused conversation has emerged among CIO’s and IT Directors regarding the trend of internalizing recovery efforts within their organizations.

As organizations look to streamline operations, implement technology-integrated processes and improve the “value” of internal technology services; traditional third-party shared-service recovery is falling out of favor.

Beyond the simple desire to reduce operating costs and meet regulatory requirements, companies are looking for “always on” or resilient strategies to provide a business advantage in the marketplace. Much of the focus towards resilience has been technology-oriented (i.e. self-healing networks, synchronous data mirroring, load-balanced platforms); however, a number of organizations have also adopted this premise for their internal business operations.

Traditional hot-site companies, such as IBM and SunGard, have historically built the business case for recovery in the shared service model, deterring organizations from implementing their own capabilities. Limited access to recovery resources, restricted testing opportunities and complicated change/configuration management truly are encouraging organizations to control more of their recovery solution. While these business cases may still be effective in some situations, they are not strong enough to prevent this trend from growing across the industry.

For the traditional hot-site companies, the old adage of “if you can’t beat them, join them” may be appropriate. IBM, SunGard and HP have all started to develop strategies and services to assist companies that are in the throws of analyzing the appropriate availability solutions for their businesses. Many of the hot-site vendors are also seeing an increase in client-owned hardware being run on their floors (i.e. a co-location model) to replace resources that may have once been shared with multiple subscribed customers.

Before moving too far along, we should identify some key recovery terms that we will use as common themes through the rest of this conversation.

Dedicated Recovery. A dedicated recovery capability is simply resources (e.g. storage, network, or platforms) that are dedicated to a specific organization or business unit. A dedicated recovery strategy can be implemented within a company-owned data center, at a leased co-location facility, or from a recovery vendor’s data center.

High Availability. High availability is a technology solution that is engineered to minimize system downtime (recovery time objective) and potential data loss (recovery point objective) through diverse, dedicated processing platforms and synchronous (or near synchronous) offsite replication of data. Possibly the biggest benefit from a high availability solution is that it protects the data from potential destruction in the event of a disaster.

As the below graphic illustrates, high availability and dedicated recovery are key components to enabling a viable solution to mitigate both availability-related and disaster-like events.

Organizations have taken many different paths to build the business case for resilience, especially ‘Always-On’ solutions. Later in this article we will explain a few examples that attempted to justify their company’s investments in disaster recovery.

Many organizations have partially or fully integrated their core business processes with significant and complicated ERP (enterprise resource planning) systems such as JD Edwards, SAP or PeopleSoft. Resource management, financial management, HR, manufacturing, distribution and more are operating on these packages and downtime can come at a significant cost to organizations. Keep in mind that many organizations no longer fall back on “manual processes” to deal with system downtime. Additionally, many of these large ERP systems are deployed nationally or globally, expanding the impact of outages exponentially.

In today’s business environment, minutes can mean millions for many organizations, and infrastructure (i.e. platforms, networks, storage) is an investment to sustain business advantage. Our ability to react and adapt to outages and disasters needs to evolve in order to meet the business sense of urgency.

In recent months many of the hardware vendors have matured their platforms to better support “always-on” solutions: mirroring, replication, virtualization and load balancing technology have all improved in capability and performance while at the same time cost has drawn down. These technologies are becoming especially important as the sheer volume of retained business data increases, coupled with shorter acceptable backup windows. Larger organizations are managing petabytes of data every year, making the backup and restore philosophy of yesteryear antiquated.

Once upon a time, re-synchronizing applications was a reasonably manageable task – restore the system or LPAR from tape or disk, bring up the onlines and verify recovery. Many organizations have grown comfortable with a “contained” recovery solution; however, today things are much different. Hundreds, if not thousands of systems are now running in most datacenters across many different platforms (Unix, OS/390, Windows, Linux, AIX, Solaris, et al.) with multiple thousands of touch-points between each and an increasing presence of internet and B2B facing systems. Many organizations try to ask the question “What is critical in my datacenter?”, without fully understanding the significant number of feeds, batch processes, and dependencies between the resources. Deciding which and how many of each of these resources need to be built into a recovery services contract, and developing appropriate recovery plans, is no longer a simple issue.

Finally, many companies have found themselves with multiple datacenters and infrastructure models due to acquisitions and mergers; therefore, there is a foundation today upon which many organizations can build an internal solution. These resources can be leveraged in order to improve the resilience of the organization through multiple varieties of internal solutions.

Throughout this discussion, the concept of “always-on” has been referred to in multiple instances. It is important to understand that this strategy is comprised of the following:

1. “Always-on” is a strategy that employs solutions used to address both OPERATIONAL OUTAGES (short-duration, high frequency events) and DISASTERS (long-duration/impact, low frequency events).

2. “Always-on” is a strategy that does not need to be “DECLARED”, rather the design of the solution monitors and reacts to the event at hand.

3. “Always-on” is a strategy that employs DEDICATED resources that are always provisioned, managed and ready to be used to sustain production operations.

Keep in mind that ‘”always-on” is typically only one portion of a complete internal recovery strategy. For many organizations this level of availability is only implemented to support the most mission-critical applications. Other, less expensive solutions are typically relied upon within internal recovery to address various tiers of systems.

The following four approaches are examples of solutions that could be implemented by a customer that wants to leverage internal resources for recovery. None of these solutions are mutually exclusive – two or more of them could be implemented in a hybrid fashion. Each of these solutions is highly dependent on the size, geography, scope and interdependencies within each client environment.

Using Expandable IT Capacities and “Quick Ship”

07DRG_p100

A company may choose to utilize excess capacity, standby engines on processors or empty storage frames that can be populated at the time of disruption.

In order to employ this solution, the recovery facility needs to be outfit with “environmentals”, power, network and HVAC. Additionally, contracts must be in place to quickly acquire additional recovery assets.

This solution has limitations because you not only have to acquire the capacity, but also restore the data and applications. This solution is most appropriate for less important applications or business processes with requirements of 72 hours or greater but clearly this is NOT an “always on” solution.

Repurposing Test, Development or QA Resources

A company may choose to separate their production applications and data from their QA, test or development environment. This separate environment can be “repurposed” to serve as a production environment at the time of disaster.

In order to employ this solution, the QA / test / development environment should be located in a datacenter that is at least 150 miles apart from the production site. Additionally it is recommended that each site be located on a different power grid and diverse network.

This solution has limitations because you would have to take down the test and development environment before the restoration begins. This solution could potentially be used for requirements that are 48 hours or less.

Workload Balancing with Expandable IT Capacities (multi-site exploitation)

A company may choose to split production capacity between two or more locations. In this solution some companies actually route or load-balance traffic between sites and platforms and mirror the attached storage.

In order to employ this solution, each datacenter should be located at least 150 miles apart from each other and not subject to any common threats (i.e. shared flood zone, hurricane path and fault lines). Additionally it is recommended that each site be located on a different power grid and diverse network.

This solution is not only more expensive than the above solutions, but also is a significant challenge to maintain. Companies need to be cautious as change management is employed across all sites.

Real-Time Solution at Client Remote Facility (disk mirroring, server redundancy and replication)

A company can develop a secondary location just like a hot site – for recovery purposes only. This site may be further supported by additional “bunker” sites that are used to provide remote data storage capabilities.

This has been done by some companies that have very stringent recovery requirements. It is the most expensive of the described solutions however provides the highest level of protection. This solution is comprised of both mirrored storage and dedicated processing capacity augmented with “onDemand” processing capacities.

The following data protection concepts have emerged on the scene and are being increasingly adopted by organizations to meet internal “always on” recovery goals.

Continuous Data Protection (CDP) is a new type of “virtualized” disk:

  • A “journal” of every write to a disk is maintained
  • Essentially provides a finer degree of granularity to Flashcopy technology
  • RTO Advantage – No backup window
  • Transactions/journal logs continuously captured, changes stored independent of the primary data
  • RPO Advantage – Enables recovery points from any point in the past (APIT)
  • Up to a certain time/date, then degrades to “Single Pt in Time” (ILM)
  • RTO & RPO measured in seconds to minutes
  • Addresses logical data protection problems

Advanced Replication Management

  • Disk Replication Functions are increasingly complex, requiring software assistance to setup and manage
  • New Disk Replication functions have been introduced yearly since 2005
  • The requirements for a single virtual volume to support multiple simultaneous relationships has increased (i.e. multi-target, hyperswap, CDP)
  • The emergence of Server Virtualization similarly enables LPAR and Application “replication” or movement
  • Coordinating the combined replication of disks and LPARs/Applications is and will be required

While there are many tangible organizational benefits to adopting “always-on” recovery, a number of potential pitfalls and issues need to be fully recognized in order to fully gain the advantages of the strategy.

1. The organization needs to clearly define those outages that should be remediated through a fully-mechanized process (i.e. WAN routing metrics) and which need to be engineered to accommodate a more traditional “decision/declaration based” action.

2. In an “always-on” solution, change/configuration management needs to be carefully planned and executed across the entire environment; improperly implemented patches or code changes may cause unintended consequences throughout the entire environment.

3. Information Security hardening needs to be fully integrated and managed within both the production and recovery environments.

In Closing

Business, process and technology changes are constantly driving down the acceptance to downtime and data loss, at times coming at odds with organizations’ recovery strategies. Many companies have taken it upon themselves to investigate and implement internal capabilities in order to increase control, reduce costs, and shorten the recovery window. “Always on” solutions are becoming increasingly popular as businesses centralize operations and systems, and move to global 24x7x365 operations. Implementing internal recovery and in particular “always-on” solutions are not without pitfalls; however, with careful planning and organizational commitment, these capabilities may provide a more resilient operation and competitive advantage in the marketplace.


About the Authors
Damian Walch, CISA, CISSP, CBCP, MBCI, is a Director with Deloitte & Touche LLP’s Enterprise Risk Services. He has worked with over 120 companies on designing business continuity and availability strategies. He can be reached at [email protected].

Matthew Zielinski is currently a Senior Consultant within Deloitte & Touche, responsible for delivering business continuity planning solutions to clients across all industries. He is a Certified Business Continuity Planner (CBCP) from DRI International . He may be reached via email at [email protected].

Recommend0 recommendationsPublished in IT Availability & Security

Share This Story, Choose Your Platform!

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.