Eliminate outages, failures, and downtime, ensure that there are “no single points of failure”. How easy would it be to achieve this?

I was recently in a conversation with a Chief Information Security Officer (CISO) discussing the organization’s IT / Cyber communications, and the conversation moved to a request from one of their customers for provision of 100% Availability.

At this point I found myself recalling that 100% availability (high availability) should aim to achieve the ultimate goal in technology, “no outages, no failures, and no downtime” across what should be the entire service architecture and infrastructure.

For this to be delivered, there is a need to eliminate outages, failures, and downtime, ensuring that there are no single points of failure. Well, you need to ask yourself how easy would it be to achieve this.

For complex operations, such high availability has managed to come close to 100% as the capability of technology improved. Today it is possible to achieve four nines (99.99%) and five nines (99.999%) of uptime, mainly for specific type of datacentres and identified system components; however, they do not form the whole. Anyway, four nines of uptime can result in a whopping 52.6 minutes of potential annual downtime, and though five nines reduces this figure significantly, a lot can still go wrong in only 5 minutes. Again, we might be able to accept this for a particular element; however, would this be sufficient for your end-to-end service technology, or perhaps just critical systems and supporting infrastructure?

It has been said by Information Technology (IT) suppliers that 100% availability “removes the burden of administration and maintenance from your IT team” as there are “no outages to recover from, ever”, being always-on, always-fast, and always accessible. And if we take this to the next stage, this could mean that we will not need Network Operations Centres (NOC) and / or Security Operations Centres (SOC), or even IT staff for incident management.

Why Incident Management is Needed

The following mature organizations are examples of why they are needed:

Azure outage due to overheating triggered by a cooling system failure in their UK South datacentre on 15thSeptember 2020;

Verizon Fios service suffered a major outage causing a 12% traffic volume drop on 26th January 2021;

OVHCloud datacentre fire in Strasbourg on 10th March 2021;

Fastly, a cloud computing provider which underpins websites of major organizations such as Amazon, Twitch, Spotify, Stack Overflow, GitHub, gov.uk, Pinterest, Reddit, Shopify, Stripe, PayPal, Vimeo, and news outlets CNN, The Guardian, Financial Times, and The New York Times, was behind the outage on 8th June 2021;

Facebook suffered a major outage that affected its subsidiaries, including WhatsApp, Instagram, and Oculus, causing the largest communications outage in history on 4th October 2021;

Comcast experienced an outage that disrupted different parts of its network at different times, sometimes hours apart, affecting thousands of customers across the USA on 8thNovember 2021

The CISO then asked me what some people might seem to be a simple question for someone with a lifetime of experience within technology, “okay, what does good look like?”

What Does Good Look Like?

I thought that this was a really good question, more so as the CISO’s organization not only had a distributed architecture (across multiple third-party platforms as well as hosting with global cloud service providers, supporting a range of physical and logical components), but customers were now requesting the magic 100% availability, meaning NO downtime!

My reply began with industry best practice, based upon the following organizations:

International Organization for Standardization (ISO);
British Standards Institution (BSI / BS);
National Institute of Standards and Technology (NIST);
American Society for Industrial Security (ASIS) International;
National Emergency Crisis and Disasters Management Authority (NCEMA)

This led me to list several important standards that would prove that consideration of standards had been made and for which they should be certified against, not only for themselves, but also their service providers.

Well, we didn’t discuss all of these standards; however, the CISO accepted the need for standards but questioned whether they would be sufficient for the overall service that could be delivered by multiple service providers, with each one delivering specific components within a complex change control technology environment without joint ownership.

This was relevant, as the CISO worked for a fast-moving agile and developing organization, with their Dev Ops making significant changes, the ability to ensure that everyone (especially partners and service providers) was on board was a pretty big ask, but not impossible. And with this in mind, I moved our conversation onto the benefits of establishing a “state of readiness”, based on the organization’s planning scenarios.

At this stage it is worth reminding the reader that Information Technology products and services have become more sophisticated and are seen as an indispensable part of our daily life. Therefore, market-leading organizations using these resources and capabilities should really be asking questions of their suppliers, and those suppliers should be using their information technology service management (ITSM) capabilities as their first line of defense.

Now, what else should be on our list? So, for those who fall under the focus of financial regulators in the UK, Europe, and the USA, there are a range of subject matters now categorized as Operational Resilience.

Examples include the following:

Business Continuity Management (BCM)
Information Security Cyber Resilience
Information Technology Disaster Recovery (ITDR)
Supply Chain Management (what if the supplier goes bankrupt and / or closes?)
Crisis Management and Communications

Certainly, the path being laid out by financial regulators establishes a stronger regulatory framework to promote operational resilience of firms and financial market infrastructures (FMIs). This is achieved through a greater focus on third-party suppliers and based upon recent technology outages, does seem a natural step forward.

Returning to “what does good look like?” — from my point of view one should also consider:

RACI – Responsible / Accountable / Consulted / Informed;
IT and Cyber Challenges (not least patching vulnerabilities);
Applicable Standards that are aligned with Industry Best Practice – view an extensive list of standards

In summary, hopefully now you can understand the challenge that I had in answering the CISO’s question, and maybe all that I have achieved is to scratch the surface of what should be considered!

It is vital to understand the importance of technology, network and cloud service options, as well as the potential impacts that loss of any of these services would have on operations and important business services.

Your challenge is therefore to identify those threats and vulnerabilities that could cause significant disruptions and physically damaging events, and to ensure that for each one that has been identified the necessary resources and contingency capabilities have been made available.

Republished with permission from the Resilience Association.

0 recommendationsPublished in IT Availability & Security

What Does Good Look Like, for Information Technology and Cyber?

Eliminate outages, failures, and downtime, ensure that there are “no single points of failure”. How easy would it be to achieve this?

Why Incident Management is Needed

What Does Good Look Like?

About the Author: Steve Yates

Leave A Comment Cancel reply

16 Dangerous Myths about Corporate Data Backup

Why Every Manager Should Speak Supply Chain

Operational Resilience – Learn to withstand disruptions and continue operations!

Transform Cybersecurity Decision-Making With Cyber Risk Quantification

The Power of Design-Build in a Post-Disaster Loss Environment

What Does Good Look Like, for Information Technology and Cyber?

Eliminate outages, failures, and downtime, ensure that there are “no single points of failure”. How easy would it be to achieve this?

Why Incident Management is Needed

What Does Good Look Like?

Share This Story, Choose Your Platform!

About the Author: Steve Yates

Related Posts

Leave A Comment Cancel reply