Microsoft Azure’s recent outage was caused by an embarrassingly trivial problem: expired SSL certificates. A lot of commentators are rolling their eyes. I saw one snarky comment that the “blue screen of death” is now available in cloud as well as on your desktop.
I agree that this was a regrettable and embarrassing problem, but I draw a different conclusion. Amazon’s 2011 outage was caused by a human-initiated maintenance action that looked innocuous; I’m confident that many other cloud problems are traceable to human fallibility as well. If the best-organized IaaS offerings in the world are vulnerable to this sort of human error, then we haven’t yet learned the truth at the foundation of cloud computing:
Manual work that should be automated is a recipe for disaster.
Notice that I said “manual work that should be automated.” Plenty of manual work doesn’t fit the bill. I want as much skill and ingenuity as I can get from my doctor, my software designer, my teacher, my pilot, the tech support person fixing my internet problem, and the person on the other side of the dining table.
But the key claim of cloud computing is that management of commodity hardware and software no longer falls into the should-be-manual category. Smart human judgment should be encapsulated in a policy that expresses tradeoffs to optimize business value, and then we should let computers do the never-ending grunt work.
Computers can remember when SSL certificates need to be renewed much better than any human. Forget that, and all the fancy features of your “cloud” decay into an amorphous, chaotic fog that pleases nobody. In fact, I’d go so far as to say that policy-driven automation–not virtualization–is the true aim of cloud. (Virtualization is a means to the automation end, not an end unto itself).
If you’re considering how to put the power of cloud computing–whether public or private–to work for your organization, then you should give serious thought to how robust the policy features of that solution are.
- Can you use policy to control and monitor what happens on “Patch Tuesday?”
- Can you decide in advance how hypervisor resources get committed and released as system workload ebbs and flows?
- Can you guarantee SLAs by policy, so the consumers on the other side of IaaS get predictable bang-for-the-buck out of an investment in hardware and software?
If you’d like to learn more about policy-based optimization, browse some of Adaptive Computing’s cloud resources. We are in the business of building the policy intelligence that will power industry’s cloud technology for decades to come, and we’d love to talk.