I don’t know how people got anywhere without Google Maps. I hear they printed out a “Map Quest,” whatever that is. As a kid, I assumed that adults just knew all the direction and that one day I would too.

Boy, was I wrong! Today, I am so heavily dependent on Google Maps that if I don’t have my phone (low battery, poor internet connection) there is a huge possibility that I will keep going around in circles. I would be lost, or worse, I’d be late! It would be a disaster.

To avoid disaster, I have a continuity and recovery plan: ask someone for directions.

Having a plan to respond to a problem generally allows you to respond to said problem quickly and to limit its impact. Generally, the bigger a problem could be, the more important it is to have a plan in place to respond to it. It’s very important for businesses to have a continuity and recovery plan for business- stopping circumstances like natural disasters and cyber attacks.

In the old days of the 1970’s when a majority of businesses held paper records, which although susceptible to fire and theft, didn’t depend on reliable IT infrastructure.

1990’s saw the development of three-tier architecture separating data from the application layer and user interface. This made data maintenance and backup an easier process.

2000’s saw server virtualization. Virtualized servers could be restored in a matter of hours because businesses no longer need to rebuild operating systems, servers and applications separately.

The 2010’s saw the rise of cloud computing. The software business has been in flux and software as a service (SaaS) platform have made it so that data is replicated within minutes of the production systems and can be running as the live systems within minutes.

The work and prep you need to do for DR is low, but it is not zero.

You may believe that you don’t need to worry about disaster recovery, since you don’t own the servers anymore. Today, your DR plan might be: “AWS/Google/Azure/Other Cloud Provider will take care of it.”

But, if you are only relying on your SaaS provider’s native security tools, you are asking for trouble.

Disaster Recovery in a cloud environment may not mean how you deal with a failover, but how you deal with data corruption and loss.

Your data might be safe, but did you consider human error? What if someone accidentally deleted important files? What if someone overwrites documents? What if they download viruses? What if the cloud servers have an outage?

And these are just some things that could happen accidentally, not even exploring the dark world of malicious attackers.

Regardless of the SaaS vendor’s reliability, a SaaS vendor is still a single point of commercial and perhaps legal control. Sure, SaaS vendors have a recovery solution. But is it fast? Is it easy? What about the cost? Will the recovered data be in a useful format?

High quality SaaS solutions provide robust security measures to protect their customer’s data, but there may be circumstances that allow your SaaS provider to pass the buck to you. If you read your cloud provider’s cybersecurity compliance reports and they state something along the lines of “customers are responsible for developing their own disaster recovery and business continuity plans that address the inability to access or utilize our services,” then you better have a backup plan in place, cause they don’t have to help you.

Now you may be thinking: If AWS or Google or Microsoft are down, the world would have bigger problems to worry about. True, but you want to do everything you can to avoid revenue losses, retain customer trust and ensure your business’s survival, right? A cloud platform’s localized loss may not affect all their customers, but it could affect you. If you want to mitigate risks to your business, have a plan in place for the cloud outage scenario.

Cloud Outages: More Likely than You Think

Many big companies rely on AWS cloud, such as Netflix. In fact, they were the most important customer during AWS’s rise to the top of the public cloud market. Netflix uses AWS for nearly all its computing and storage needs.

So, when AWS had an outage in 2011, you can only imagine the amount of chaos it caused. A number of popular websites that depended on AWS were taken completely offline for several hours including social media hubs like Reddit, Quora, and Hootsuite. The AWS issue wasn’t completely resolved and even more than 24 hours later, the services were still degraded.

Reddit was down or at least degraded for 36 hours. It’s not that AWS isn’t to blame, but there are a few things that Reddit could have done better. For instance, they were using a single Elastic Block Store (EBS) disk to backup some of their older master databases. They also acknowledged that their code was written with the assumption that there would be one datacenter. In short – a single point of failure. If Reddit had a plan in place, their operations would have been less affected.

Compare Reddit’s outage response with Netflix. As stated, they also rely on AWS, but they had a very different outcome. The company saw a higher than usual error rate and higher latency, but the outage did not affect Netflix’s customer’s abilities to find and start movies, nor did they see a noticeable increase in customer service calls. In a practical sense, Netflix was almost entirely unaffected by this outage.

Why were some websites affected and others not?

They designed their systems explicitly to deal with these sorts of failures. Netflix designed “Chaos Monkey” to simulate failure- a service that kills other services. It works by intentionally disabling computers in the production network to test how remaining systems respond to the outage. This prepares you and your team for random instance failures. Chaos Monkey helped jumpstart Chaos Engineering as a new engineering practice. In 2011, inspired by the success of their original Chaos Monkey, Netflix added a series of tools known as the Simian Army. The Simian Army is open-source and consists of monkeys (tools) ranging from Chaos Monkey and Janitor Monkey to Conformity Monkey.

Netflix has obviously invested a great deal of resources into developing their disaster response plans. Take some lessons from them when developing your own.

Netflix’s Secret for Uptime Success

Data Stored Across Zones

Using multiple Availability Zone configurations for your most critical database workloads. This will ensure that there are multiple redundant hot copies of the data spread across zones. If one zone fails, a failover is automatically performed to the standby database.

“N+1” Redundancy

Designing cloud architecture with N+1 redundancy means allocating more capacity than is actually needed at any point in time. This will give you the ability to cope with large spikes in load. This definitely costs more money; however, it would make your systems more resilient to failures.

Stateless Services

Stateless services don’t track information from call to call. They are designed such that any service instance can serve any request in a timely fashion. There is no need to maintain state, and often the complexity revolves around maintaining session state which typically involves replicating the session state across the cluster. This replication approach is used to maintain the session state when one of the servers goes offline. So, if/when a server fails, it isn’t a big deal. The request can be routed to another service instance and a new node can automatically be spun up to replace it.

Take note that a stateless service may not be right for your application, which could rule this option out.

Graceful degradation

Graceful degradation has been an important consideration in the design and implementation of large communication networks since the Internet was conceived by ARPA. Failure is not an exception in software, it is the rule. That is why ‘Graceful Degradation’ is a key concept. Use the design philosophy of Graceful Degradation to enable systems to continue their intended operation, possibly at a reduced level, when some part of the system fails. The purpose of graceful degradation is to prevent catastrophic failure. If a service stops meeting its SLA, calling services should taper calls to it and resort to a backup behavior. This will prevent failures from cascading.

What else can you do?

Start with a Business Impact Analysis – evaluate the potential effects of an interruption to critical business operations as a result of a disaster, accident or emergency. It will give you a clear picture of the criticality of your business operations and will help you identify the dependencies. Then you can use this information to build an effective strategy that addresses only those areas that need to be recovered and the designated time frame in which to recover them.

While most enterprises aren’t ready for the randomness of a system like Chaos Monkey (or Simian Army) there are some takeaways from the concept that apply to traditional infrastructures.

Testing

Performing tests of individual system outages is highly important. It is better to burn in a fire drill than an actual fire, after all. The purpose of testing is to discover flaws in your (DR) plan so they can be resolved before they impact operations. Regular testing is the only way to guarantee that you can restore customer operations quickly following an outage. DR testing should be done at least annually.

If you work with an MSP, you need to make sure that they have proof of your daily backups and make sure that the backup is functioning properly.

Automate Zone Failover and Recovery

Determine a priority order for your most critical applications by conducting a risk assessment and a business impact analysis. Once identified, automate the recovery process to ensure minimum data loss. This would entail using tools and services that allow automatic failover to your standby systems. Automation will not only reduce recovery times but will also reduce the need for human intervention and judgment. Additionally, look for other processes that may fail and can be automatically recovered.

Multiple Application Servers

Deploying applications over multiple servers will keep applications running efficiently and reduce downtime. Using multiple servers allows configuration of fault tolerance. More servers also means improved scalability.

Multiple Region Support

Don’t put all your eggs in one basket. Spread out. So, you can have the ability to completely vacate a region if a large-scale outage occurs.

As an extension of the same logic, using multi-cloud disaster recovery would enable you to replicate your resources to a second cloud provider in another geographic region. So even when the primary environment is compromised, you can keep your services running.

Conclusion

Remember Murphy’s Law? “Anything that can go wrong, will go wrong.”

Disasters don’t happen very often, but unfortunately, they usually happen at an inconvenient time, with little or no warning. It is important to put thought into what to do when (not if) a component fails and plan for outlier events.

I don’t want to be stuck somewhere just because Google Maps won’t work, but I can ask someone for directions – a response plan.

Don’t get lost! Your company needs a disaster response plan.

Disaster Recovery in 2021 can be thought of more as a thought exercise that you do to cover all the possible, if not plausible, disasters. DR is a very specific type of emergency response, so it can be the final chapter of your Incident Response plan instead of a whole new book.

This article was republished with permission from Fractional CISO. https://fractionalciso.com/be-like-netflix-not-reddit-saas-disaster-response/

Want to get great cybersecurity content delivered to your inbox? Sign up for their monthly newsletter, Tales from the Click! https://fractionalciso.com/newsletter/

1 recommendationPublished in IT Availability & Security