These were two limitations of our initial deployment and release design and so in sharing this blog I am hoping the lessons learnt will be able to help you avoid the same issues.
Before jumping into the solution, I'll quickly describe our starting point.
Please note: This is a simplified view of the infrastructure, we'll only describe elements relevant to the zero downtime aspect of the solution.
Configuration ? before
This meant, in order to perform zero downtime releases, we had to remove Region1 from Traffic Manager, update it while offline, add it back into Traffic Manager & repeat for Region2. This led to the next issue with this setup.
In the current design, we have no way of routing based on requested version and only have one Ingress flow for both scenarios of traffic.
This meant during the time in which Region1 was updated and re-introduced to Traffic Manager, and Region2 was being drained (typically between 5 10 minutes), existing user sessions on Region2 would have no route to the version specific assets on Region1, resulting in a 404.3. Slow release processThe last issue with this set up was the amount of time it took to perform a release. Although the only manual step in the release process as mentioned above was the removing/re-adding of regions from Traffic Manager, waiting for connections to drain before performing the switch to the new version drastically slowed down our release process, which meant redirecting to a new, already tested release, could take up to 20 minutes.
The fixThe plan for resolving this was quite simple and can be summarised by three key points:
1. Remove the need for Nginx Ingress controller reloads during redirection.The root of our redirection issues is the need for Nginx Ingress controller reloads, and the 1 second of downtime this incurs. We therefore need to ensure all Ingress resources remain static, meaning their downstream config cannot change after initial configuration. In doing this, we need to find another way to redirect traffic. This is where label selectors come in.
Label selectors allow us to route traffic to target pods within a single namespace without incurring a connection drop. This was the key to unlocking zero downtime deployments without Traffic Manager manipulation.
In order to facilitate this, we had to move away from blue/green namespaces to a single production namespace, meaning our immutable component of release was no longer the entire namespace and was instead the Helm release.
2. New default Ingress & service resourcesWe would introduce a new 'default' Ingress flow to handle new user traffic. The Ingress resources within this flow would remain static and we would instead utilise label selectors for zero downtime redirection.
3. Restructure Ingress resourcesTo facilitate the second use case of traffic routing, we would utilise URL path based routing to introduce new version specific Ingress flows for existing user sessions.
Configuration ? after