Avoiding an IT Crisis: Understanding and Overcoming the Pitfalls of Replication Systems

By Kelly Jones|2022-03-29T19:19:01+00:00January 10th, 2006|0 Comments

As data and applications have become critical to an organization’s ability to operate, companies have invested heavily to protect these assets. One technology designed to protect both data and applications is replication. However, despite millions of dollars invested in them, replication solutions often fail in practice, when they are needed most. This paper analyzes the key points of failure and suggests how to avoid them.

Simply stated, replication involves creating a second copy of data as a means of protecting that data. Replication can be leveraged to ensure high availability of mission-critical business applications. This paper focuses on replication applied to the problem of delivering high availability. This scenario is one of the most challenging and complex uses of replication, but when properly implemented, yields high levels of continuity for mission-critical applications.

When used for high availability, replication is performed between two different locations, rather than within the same location. Using multiple locations reduces the risk associated with facility and regional disasters. As the problem focus is delivering high availability, replication is performed continuously rather than in batch mode. Continuous replication maintains the most consistent copy of data without the gaps that occur with a batch approach.

Replication is sometimes confused with clustering, which involves using two or more storage devices at the same location to simultaneously receive the same data stream. This provides two identical copies of the data and mitigates the risk of a disk failure. It does not address other issues associated with high availability. This paper does not discuss clustering or compare clustering to replication.

Replication Solution Overview
A company has many options for how to protect data. including tape backups, disk mirroring, electronic vaulting, and data replication. Data replication is a powerful alternative to other techniques. When properly installed and managed, data replication provides more than data protection, it enables application availability.

There are many ways to deploy a replication solution to safeguard against a company’s most pressing protection risks. Solutions can be deployed either locally or remotely, synchronous or asynchronous. The multitude of options allows a company to deploy an optimal solution based on its need, but creates complexity both in deployment and manageability. This complexity is the root of many challenges associated with replication and often a catalyst of failure.

Data Protection Versus High Availability
When evaluating replication options, it is important to understand the specific reasons for the solution. Two primary reasons for implementing a replication solution are data protection and high availability.

Data protection is primarily focused on data integrity, ensuring that the data and all changes to that data are captured in a backup copy. Immediate access and continued use of the data are not top priorities. For example, it is better to ensure 100-percent data accuracy with medical records than it is to ensure 100-percent real-time access. Inaccurate prescription data may result in loss of life, while a delay in refilling a prescription will only result in a patient becoming annoyed.

High availability, on the other hand, is concerned with three attributes: data integrity, immediate access, and continued application use. A financial clearinghouse needs to ensure its data is accurate, as trade commitments are legally binding. Furthermore, trades need to execute in a timely fashion or else the clearinghouse incurs financial risks. Satisfying the additional need for data access and continued application use requires continuity-driven replication solutions to provide failover and failback capabilities.

A failover is an action initiated by IT personnel to move a secondary application environment from a backup role to a primary role. Usually a failover occurs when a problem develops that prevents the primary environment from properly functioning. The secondary environment starts providing end users with the services originally provided by the primary environment. The data that was replicated to the secondary environment is now put to use. Failback is similar to failover, but returns control back to the primary environment once it is restored to a proper working state. These capabilities turn out to be far more complex and challenging than basic data replication.

Further complicating the technical challenges associated with failover and failback is that high availability solutions must hedge against regional disasters and outages. This requires the deployment of the high availability replication solution in a geographically remote backup environment, ideally more than 20 miles away. Distance requires the complexity of bandwidth latency and cost constraints, factors that are negligible over a LAN.

Using Replication to Provide High Availability
Understanding replication and its prominent place in delivering high availability for mission-critical applications requires a deeper look at the key elements that enable high availability, particularly across a geographically diverse or complex IT environment.

Data Replication – the set of processes that collect and move data from one source to a second source. Data replication includes a process to watch a data repository for new and changed data, which it then duplicates and places in a replication queue. The replication queue stores and manages the pending data transactions, which are then replicated to the secondary data repository. A process monitors and verifies that the transaction arrived safely at the destination. The backup data provides a source safe from many issues that can affect the primary data source. This component provides sufficient coverage to protect data sources when the data is not supplied to applications that require near continuous uptime.

Failover – the set of processes that switch a secondary data source from its backup role to a primary role. Often, this is tightly integrated with other processes that redirect users to an alternative set of application servers. Depending on the application, further security processes and control logic are required. Assuming that all systems are synchronized, all software versions and revision levels are consistent, the systems have been tested, and personnel are standing by at both data centers, the failover can proceed smoothly. During a smooth failover process, users have near-continuous access to their critical applications with minimal interruption.

Failback – involves returning operational control back to the original system. In many instances, a company’s primary environment is more robust or better situated to deliver superior end user performance, benefiting from greater bandwidth or a concentration of local users. For these performance or security reasons, a systems administrator will seek to failback to the primary system as soon as possible once the primary system has been restored to a fully functional state. Failback is similar in many aspects to failover, but adds additional steps and complexity to account for resynchronizing the two systems. This involves recovering the primary system to a working state and reestablishing data integrity, as the two data sources have become out of sync since the initial failover.

Replication Shortcomings
Just as replication solutions offer promises of risk mitigation, higher availability, and data protection, the attempted use of these solutions often results in frustration and wasted time and money for a number of reasons.

First, installation alone can be a deterrent. Companies frequently have so many issues associated with installation or management of a complex solution that they abort their implementation and “shelve” the replication software.

Once installed, many companies find replication manageable. The bigger challenge, however, is encountered in the event of a large-scale systems failure or disaster when attempts are made to activate replication solutions to maintain application availability and recover. Failover processes are difficult to test and don’t always work as expected.

Accounting for all permutations of failover requires extensive process mapping across the entire IT environment that supports the application. Finding the expertise required to perform this mapping is difficult and rarely exists within one group in a company, or even throughout the entire company. Failure to adequately address all possible failover scenarios creates risk and possibly a false sense of security.

Top 5 Causes of Replication Failure
An examination of the top five causes of replication failure provides insight into the challenges a company faces when seeking to successfully implement a replication solution, as well as a useful checklist for benchmarking purposes.

05DRG_p96

1. Secondary Environment Not Ready for Failover
A company establishing a secondary environment to provide superior protection of its most important applications and data must address many challenges. One of the biggest challenges is maintaining two nearly identical environments. All software, patches, and access levels need to be consistent. Over time, many changes that are made to one system fail to get implemented on the other system. These differences often result in a secondary environment that is out of sync with the primary environment to the point where failover cannot occur.

Another factor impacting failover readiness in the secondary environment is that critical processes do not function properly. Most organizations do not have clear visibility into the readiness of their systems for failover, and are surprised by one of the following problems:

  • Replication is not performing normally
  • The replication queue is too large
  • The secondary environment is not healthy
  • Primary/secondary environment software and configurations are out of sync
  • Dependent systems are not designed for failover

It is critical for a company to take the time necessary to develop processes to control the introduction and distribution of changes and updates to both environments. One undetected change can cause hours- or days-long delays in service resumption. It is equally important to monitor all critical processes that impact the readiness of the secondary environment, as early problem detection ensures system readiness when a situation demands use of the secondary system.

2. Manual Error in Failover Process
People are often the weak link in a failover process. Manual errors introduced during a failover sequence can corrupt the entire process. This results in a need to fix the problems resulting from the error and further requires the entire process to be restarted.

People are more prone to make mistakes during a crisis. The more complex and step-intensive the failover process, the more likely mistakes will occur and result in failure. As an example, it takes 350 steps to failover a 10-server Microsoft Exchange environment. Any single human error in that sequence will break the failover process. A few examples of mistakes include:

  • Missed process steps
  • Steps executed out of sequence
  • Process initiated before dependent steps fully executed
  • Misjudge state of primary or secondary
  • Typing error
  • Steps out-of-date for current software versions

Often, these steps must be performed under pressure or over a remote connection. The steps required to remediate errors introduced during a failover are complex, often uncharted, and should be avoided at all cost.

3. Experts Not Available During Crisis
Failover processes are dependent on expert staff who may not be available during a crisis. The failover process touches upon all technical disciplines, from hardware and operating systems, to applications and databases, to networking and security. In a large organization, these disciplines are highly specialized with different personnel responsible for each. If any one person is unavailable, the failover process can break down. There are many reasons people are not available, including:

  • Occupied with other crisis efforts
  • Physically unable to access facilities
  • Can’t be contacted during an emergency
  • No internet access available to control failover

As part of the process to develop and deploy a failover solution, it is important to establish a list of required skills and resources available for each skill. Full contact information and predetermined communications protocols need to be created, continuously updated, and readily available to all team members. These steps will aid recovery efforts and mitigate some personnel risks.

4. Failover Process Unable to Scale
In large organizations, the scale of a failover or recovery effort can become a critical bottleneck. The technical staff is limited, often managing large numbers of systems. A critical application failure or a facility problem can result in dozens or more systems that require failover. Multi-server failover is a serial, manual process. The technical staff comes under tremendous pressure to rapidly restore service. Under these conditions, scaling issues are likely, such as:

  • One administrator can only failover one server at a time
  • A 25+-server environment will take 10+ hours for one administrator to failover or failback
  • Many mutually dependent systems will not work until the entire environment has failed over

5. Untested Failover Assumptions Don’t Work
Complex multi-server failover is often too sensitive to fully test. Without testing every permutation of systems and failure causes, it’s impossible to know exactly what will happen during a real crisis. Issues that contribute to failover breakdowns include:

  • Large, complex environments can have many failure types and scenarios
  • Multi-server failover involves constantly changing conditions
  • Server-by-server failover is very different from a holistic failover of the entire environment
  • Different failures result in different behaviors

Some methods available to mitigate risk include incorporating “what-if” scenario planning sessions and “pre-mortems,” a form of role playing that allows a technical staff to identify untested failover scenarios and potential bottlenecks.

The Impact of Replication Failure
Understanding the impact of replication failure is essential to making informed business decisions about replication investments. The following provides a summary of the key issues associated with replication failure.

Replication failures have long recovery times. Complex recovery processes take time to implement. Restoring an entire environment requires significant skills. The availability of critical resources and personnel becomes a bottleneck. Often, there are periodic delays during the failover process as key skills are not available. If large amounts of data must be moved as part of a recovery effort, bandwidth constraints can further prolong the recovery time.

Replication failures can cause other problems. Replication failures can have wide-spread impact. Database corruption can occur, requiring substantial efforts and time to restore a database to a functional state. Even when the databases are not corrupted, effort is required to determine how much data was lost during the failure and to recover the missing transactions.

Replication failures have a high cost. Replication is almost exclusively used for critical systems. Data in these systems is important enough to protect and therefore to recover. The time and effort required to recover the data, returning it to a usable state, is substantial. Replication failure forces application downtime, resulting in negative economic consequences: lost revenue, missed opportunities, degraded customer satisfaction, and declines in shareholder value. Any replication failure will prove costly.

05DRG_p97

To Succeed, Eliminate the Risks
Replication solutions represent a powerful way to protect a company’s most important data and applications. But replication solutions, particularly those that are used to provide failover and failback for high availability environments, are fraught with risks, often resulting in a low success rate for companies attempting to implement these solutions. The key to successfully implementing a replication solution is to understand all the risks and eliminate as many as possible. Planners should start with the risks identified above in the top 5 reasons for failure. Through careful evaluation of approaches, they can select a solution that matches the best approach for the company.

Recommend0 recommendationsPublished in IT Availability & Security

Share This Story, Choose Your Platform!

About the Author: Kelly Jones

Kelly Jones, PhD, is Vice President of Technology at MessageOne, responsible for bringing new industry-leading replication and failover technologies to the market. Dr. Jones joined MessageOne from Evergreen Assurance, a provider of application availability and disaster recovery software, where he served as Senior Vice President, Technology Development and Client Operations. Prior to Evergreen Assurance, Jones founded Panacya, a successful provider of systems management and monitoring software. Jones earned his doctorate degree from Texas A & M University. Dr. Jones can be reached at [email protected] or at (512) 652-4500.

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.