Disaster recovery: If my region is down, can I still work in another region?
The benefit of Azure PaaS and Azure DevOps is that you can plan and implement your disaster recovery easily and effectively. We recently had a scenario with a customer that showcased this perfectly!
The scenario
Our customer approached us and asked if we could implement a disaster recovery scenario for them not just a failover as a test, but instead, running out of a different region for a week? The answer is yes, we can.
Azure PaaS setup:
- Traffic manager directing the traffic between regions
- App Service environment
- App Service plans x 2
- 5 WebApps with 5 APIs
- 1 function sending data to a third party source
- Managed SQL within a failover group

Azure DevOps setup:
- All code written in ARM and updated via code instead of manually on Azure portal
- Single ARM file for each app service with parameter file for Prod West/South
- Pipeline for resource creation in UK South

The failover
The customer requested a controlled disaster recovery, which also meant they wanted downtime to be limited. The day was a Friday at 4:30 pm, so a few days earlier we began deploying the infrastructure to UK South.
- We started with deploying the vNET and network security groups
- Following was the ASE. For this, we used the same static IP from UK West ASE. This meant that the 1 web app behind a VPN would not need to be updated with a new IP address
- Next, we began deploying the application gateway
- Once ASE was deployed, the 2 app service plans began
- Then the WebApps, API and function
- Lastly, the latest developer code in main was deployed to the app services
At this stage, we had deployed the full infrastructure to our second region in 2 hours 13 minutes.
Failover commences - Friday, 4.30 pm
The green light was given from the customer to commence failover:
- We stopped the app services in UK West to give an application down message
- At traffic manager, we disabled UK West endpoint and enabled UK South
- This routed customer traffic to the secondary zone
- We then went to the database and initiated database failover to the secondary zone
- With the benefits of a failover group, the infrastructure didn’t need to be updated with the secondary SQL server name
This process took 15mins from stopping the apps to restoring customer access to the site and allowing business to continue.
Over the following week, the customer was successfully running out of the UK South site with business continuing as normal. However, the purpose of this exercise was to test if it could be done in the event of a real request. As smooth as it looked on paper and in practice, we had 2 errors that have been fixed now but doing this exercise helped highlight them. One error, which was linked to the application still using local SQL accounts has prompted us to move to managed identity and improve the system further.
The following Friday we shut down the apps in UK South, the traffic manager was failed back, SQL was failed back, and UK West apps started. Within 15 mins the customer was back on production websites and business continued.
This was a successful disaster recovery exercise completed by the team which gave us and more importantly, the customer, confidence in the system built by Kainos. The next test is already scheduled in 6 months.
If you would like to learn more about LiveOps and our cloud and engineering services click here.