Lessons from the CrowdStrike and Eurostar incidents: A study in disaster recovery and business continuity
Over this summer, two major incidents significantly disrupted critical infrastructure systems worldwide. The first one was a faulty update from CrowdStrike, leading to an estimated 8.5 million systems crashing. This incident alone resulted in a global financial damage of at least US$10 billion. The second incident was a series of coordinated arson attacks on France’s high-speed rail network, strategically timed just before the Paris 2024 Olympics.
These incidents underscored the critical importance of practices such as post-incident reviews. These reviews can shed light on existing issues and help prioritise them for resolution. The CrowdStrike incident, in particular, spotlighted several issues related to well-established practices in software development, including the lack of testing and the use of languages that do not ensure memory safety.
While we can't change the past, we can extract valuable lessons from it. It is an undeniable fact that we can't completely eliminate human errors or mismanagement of established IT practices. However, we can take proactive steps to mitigate the impact when incidents do occur, regardless of their origin.
Positive example
In my perspective, the French train lines demonstrated exemplary business continuity practices during the arson attacks. Their focus on delivering “value” - ensuring passengers reach their destinations - and their swift transition from the Eurostar line to their classic train line meant that the damage was contained, and fewer people were adversely affected.
Room for Improvement
The CrowdStrike issue, while technically simple to rectify, unveiled a deeper cultural issue at its core. If the companies affected by CrowdStrike had a comprehensive understanding of their dependencies and had strategies in place to provision older versions of their servers before the faulty release, coupled with swift incident response practices, the issue could have potentially been resolved in minutes. The on-demand aspects of cloud computing could have facilitated this. Even better, proactive rollbacks implemented automatically could have rendered some disruptions virtually unnoticeable.
However, it’s important to acknowledge that human error is inevitable. Therefore, having a clear plan for when major incidents occur is vital.This involves not only embracing a culture of continuous improvement and evaluating releases to understand the risks of big-bang versus phased approaches, but also regularly reviewing and practicing techniques like chaos engineering. These practices can help to significantly reduce the damage caused by incidents and ensure that companies are better prepared to handle future disruptions.
It’s crucial to emphasise that we have overcome these problems before. Businesses need to revisit and embed these practices, fostering a culture of continuous improvement and regularly assessing their incident response scenarios. This approach will bolster resilience in the face of future incidents.
At Kainos, we tackle these challenges by fostering a culture that places a high value on individual and psychological safety. This is complemented by our commitment to industry-leading processes and best practices. Our status as ISO20000 and AWS Managed Service Provider stands as a testament to the maturity of our cloud best practices, including DevSecOps and FinOps. We cultivate an environment of innovation and agility within our teams.
In addition, it’s worth noting that our approach is not static. We believe in the power of continuous learning and improvement. We encourage our teams to stay abreast of the latest trends and technologies, ensuring that our practices evolve with the rapidly changing tech landscape. This adaptability not only enhances our service delivery but also ensures that we are prepared to effectively manage and mitigate any future incidents. Ultimately, our goal is to provide services that are resilient, reliable, and in line with the highest industry standards.