I was reading this fascinating post about how Azure handled the increased demand brought about the pandemic. 100% growth in demand created quite a challenge and Microsoft had to respond across a range of teams to deliver capacity to Azure customers.
An example of the kind of actions taken.
Microsoft turned off the little “your co-worker is typing” and read receipt notifications for Teams users in peak-demand regions, reducing the CPU capacity required to process those functions by 30% and returning that capacity to Azure customers.
This brought about a thought about architecture “Flexpoints”. Don’t worry, Flexpoints are not a DevOps concept; just a term that I’m borrowing from math to explain the current phenomenon of having to drastically scale up and scale down capacity that the Covid 19 pandemic of 2020 brought about.
We’ve seen demand drop dramatically for some services, like reservation systems of hotels, planes, concert, restaurants. We’ve seen others torque up torrentially (video, collaboration, health care).
This reminded me of a session I attended about ten years ago on Facebook scaling. Then, the use case was Facebook introducing of a major new service (personal URL’s I believe). The service would launch at a certain hour of the day and people could claim their own URL; a mad scramble that could overwhelm their service. A solution was to screenshot the page when the sign up dialog came up so the user would see the page, but it was not the live page behind the dialog. So the user experience was the same but all the underlying microservices that were called by that page were freed up.
My takeaway is that today we need to architect, build and understand our the flexpoints in our applications to deal with these sudden spikes and drops in capacity. By flexpoint, I mean the features or functions that can be turned off gracefully and quietly to free up resources when capacity is constrained. And I’m defining capacity here as both resource and financial. Whether it’s Azure demand doubling overnight (resources) or demand dropping 90% overnight (financial).
Here’s another example of financial. In my role, we provide an advanced cloud resource discovery service that is built entirely on a serverless architecture. Not a single VM is used, rather a variety of PaaS, Database and security services are consumed. It’s fast and discovers thousands of accounts to maintain cloud inventory for our organization.
As the estate has grown and the security requirements increased, I saw costs also continue to increase. So discussing with the app architects, we asked questions about how to reduce costs without impacting service. Because the discovery was scheduled to run every 15 minutes, launching thousands of lambda functions and containers, the obvious flexpoint was to change the discovery scheduling.
The team moved to run discovery every 30 minutes. The result? Costs dropped by 36%. Changing discovery to every 120 minutes would probably half that cost again. That’s an example of a financial flexpoint.
We will continue to see volatility in our markets and our society, when 100 year floods happen every year. We should have an understanding of our architectural flexpoints to respond to these black swan events.
Enjoy
PS: The actual Microsoft blog post with the fuller but drier details.