Improving zero downtime on Kubernetes
Improving zero downtime on Kubernetes
- Improve the speed of our release process
- Allow us to run multiple versions of the application concurrently
These were two limitations of our initial deployment and release design and so in sharing this blog I am hoping the lessons learnt will be able to help you avoid the same issues.
Before jumping into the solution, I'll quickly describe our starting point.
Please note: This is a simplified view of the infrastructure, we'll only describe elements relevant to the zero downtime aspect of the solution.
Configuration ? before

1. Ingress controller reloads cause 1 second of dropped traffic
The redirection mechanism, illustrated in red in the above diagram, which redirected live traffic between the 'blue' and 'green' namespaces meant an update to the Ingress resource in app publishing to point to the cold blue/green re director service resource, as shown below.

This meant, in order to perform zero downtime releases, we had to remove Region1 from Traffic Manager, update it while offline, add it back into Traffic Manager & repeat for Region2. This led to the next issue with this setup.
2. Unable to support multiple versionsThe application needs to be able to route user traffic in two scenarios:
- New users landing on the homepage should be routed to the latest stable version of the application.
- Existing user sessions requesting version specific assets should be routed to the requested version.
In the current design, we have no way of routing based on requested version and only have one Ingress flow for both scenarios of traffic.
This meant during the time in which Region1 was updated and re-introduced to Traffic Manager, and Region2 was being drained (typically between 5 10 minutes), existing user sessions on Region2 would have no route to the version specific assets on Region1, resulting in a 404.
3. Slow release process
The last issue with this set up was the amount of time it took to perform a release. Although the only manual step in the release process as mentioned above was the removing/re-adding of regions from Traffic Manager, waiting for connections to drain before performing the switch to the new version drastically slowed down our release process, which meant redirecting to a new, already tested release, could take up to 20 minutes.
The fix
The plan for resolving this was quite simple and can be summarised by three key points:
1. Remove the need for Nginx Ingress controller reloads during redirection.
The root of our redirection issues is the need for Nginx Ingress controller reloads, and the 1 second of downtime this incurs. We therefore need to ensure all Ingress resources remain static, meaning their downstream config cannot change after initial configuration. In doing this, we need to find another way to redirect traffic. This is where label selectors come in.
Label selectors allow us to route traffic to target pods within a single namespace without incurring a connection drop. This was the key to unlocking zero downtime deployments without Traffic Manager manipulation.
In order to facilitate this, we had to move away from blue/green namespaces to a single production namespace, meaning our immutable component of release was no longer the entire namespace and was instead the Helm release.
2. New default Ingress & service resources
We would introduce a new 'default' Ingress flow to handle new user traffic. The Ingress resources within this flow would remain static and we would instead utilise label selectors for zero downtime redirection.
3. Restructure Ingress resources
To facilitate the second use case of traffic routing, we would utilise URL path based routing to introduce new version specific Ingress flows for existing user sessions.
Configuration ? after

