What good incident management looks like

Managing a digital service is largely straight forward, until a 3rd party service causes chaos to large swathes of the internet. Here's how it impacted us.
Date posted
10 June 2021
Reading time
3 minutes
Stephen McCalden
Service Delivery Manager ·

I’ve been dreading a week like this for a long time.

Managing a digital service is largely straight forward, keeping a customer's solution doing what it’s supposed to be doing and delivering value to its end users. Incidents will happen from time to time - that’s perfectly normal - but when a 3rd party you have no control over that your service is integrating with to function properly has an issue, you are at the mercy of their service restoration plan. It’s awful being impacted by nothing within your control to correct.

That’s what has happened this week to large swathes of the internet - which you can read more about here. A content provider had a problem causing dozens and dozens of the world’s most well-known sites to fail. Many of them me and my team provide live service support for. So, when multiple outages all occur at the same time, multiple incidents get raised, and many concerned customers all come to you for answers it becomes the perfect storm you hope never happens in service management.

Thankfully in this instance, the issue was identified and corrected quickly to restore services but I’m grateful to the team for implementing good ITIL-aligned best practice incident management to:

image
Be alert to the problem immediately
image
Understand the impact
image
Raise the appropriate incident ticket for tracking purposes
image
Make a calm and composed diagnosis
image
Have a prompt solution design including a temporary workaround if possible
image
Execute safe and efficient release management of the fix
image
Protect the service integrity throughout
image
Provide clear, regular, and concise communications at all times

Days like this don’t happen often but when they do, if you fall back on robust procedures and follow the plan you will minimise the impact and restore the service as quickly and efficiently as possible.

Here at Kainos we are ISO20000 certified with all our Live Operations services following mature and robust ITIL-aligned service management procedures. We have a proud history of serving some of the most critical digital solutions to all areas of public, health, and commercial sectors for over 30 years. We know incidents happen and we know how to react to them so our customers can have peace of mind to know their solutions are in safe hands.

About the author

Stephen McCalden
Service Delivery Manager ·
Stephen is a service delivery manager working in Kainos’ Live Operations team. He manages teams of engineers who maintain and support many critical services on behalf of our customers ensuring they continuously meet the needs of the end users at all times.