If you didn’t notice yesterday (July 18, 2024), the Azure US Central region had a pretty major outage where a good portion of the services in that region for a number of customers were not available. The outage was even bad enough that in the communications that Microsoft sent out about the incident they even recommended activating your disaster recovery plans. Needless to say, this was a pretty bad outage. Various companies were affected by this outage as they were hosted in US Central and only US Central.
Some of DCAC’s services, like this website, were impacted by this outage. In our case, the Azure SQL DB for MySQL and the App Services hosting our main website, www.dcac.com, stayed online. However, our Redis cache, which the website uses, went offline, breaking our website. For this exact situation, we keep a second copy of our website running in North Europe with its own Redis database. That website and the Redis cache were unaffected by the outage.
While this was happening, we found that our Traffic Manager global load balancer was still sending website requests to the US Central copy of the website. To quickly fix that issue, we were able to reconfigure the Traffic Manager and remove the US Central copy of the website from its configuration. This brought our website back online and efficiently solved any issues people would have had accessing our website. While the Azure US Central outage lasted 8+ hours, our website was down in the US for less than 20 minutes.
While the Azure outage happened, we found a configuration problem with our Traffic Manager profile. We had Traffic Manager configured for Geographic routing, which allows you to configure each endpoint of the Traffic Manager to be the website for specific geographic regions; for instance, the US goes to our US site, and Europe goes to our site in Europe. However, this doesn’t allow for failover if the region fails, so while our Europe site was up and viewable for people in Europe, anyone from the US got an error message on our site. We then configured the Traffic Manager for Performance routing, which allows for failover if the website that the endpoint is using goes offline. While we had the correct technology for our needs (Traffic Manager as opposed to Azure Front Door), one setting was incorrectly set.
What we can take away from a Microsoft Azure region becoming mostly unavailable for several hours is that things will happen. We need to plan for them via our Disaster Recovery plans, having systems and processes in place that can respond automatically when these sorts of disasters occur. In our case, almost everything went according to plan, with one easy-to-fix hiccup.
When building your Disaster Recovery plan, it’s essential to have a solid understanding of every piece of the application for which you are creating the recovery plan. As IT professionals, we are all constantly learning and growing in our field. Anyone who says they don’t have more to learn in IT doesn’t understand that they don’t know everything.
In our case, we understood what the two routing options in Traffic Manager did; we didn’t test the automatic failover part of the Geographic routing. When configuring Traffic Manager for our environment, the documents didn’t mention the lack of failover on failure (they do now). We’ve now tested that the routing is working as expected (granted, because of the actual disaster event, not through a test).
If you’d like help creating your Disaster Recovery plan, contact us. We can then help you build your plan and disaster recovery system so that you are ready for the next outage. While this outage was an Azure-specific outage, every other major cloud vendor has had outages as well (AWS East US is notorious for having outages), and on-premise data centers have outages as well (Joey talks about a specific story about an on-premise outage in his article about the Crowdsrike Outage). No matter where you are hosting your IT system, a Disaster Recovery plan and Disaster Recovery systems are crucial for getting systems back up and running as fast as possible so that the outage doesn’t impact customers.
Denny