In 2008, I migrated the entire infrastructure of a small startup into Amazon’s EC2 cloud. It was a big job, and I was delighted when I finally sent an email to the sysadmin at the previous colo datacenter, announcing that he could power down our old servers.
We were going to save a ton of money.
Fast-forward a few months. I opened my monthly billing summary from Amazon, and saw surprise debits to the tune of several hundred dollars in the history.
What on earth?
A little poking around in Amazon’s control panel revealed the ugly truth: I’d forgotten to shut down a server that I no longer needed. I probably was distracted by a phone call in between the time I powered up its replacement, and the time I intended to power down the obsolete machine.
Result: a zombie that should have been dead, but lurked around and caused trouble and expense.
Theoretically, I could have made the same mistake when my servers were colocated. I think the reason why I didn’t is because tangible servers are harder to turn on, and shutting them off has more implications. You have a chunk of iron sitting around, gathering dust; the need to plan its demise is a little more in-your-face.
Unfortunately, I didn’t learn my lesson about zombie clouds. A year or two later, I did a similar project, and once again had an unpleasant experience when bills arrived.
We talk about how easy it is to turn stuff on in the cloud. Infinite resources, ultimate flexibility… We don’t often discuss the flip side—allocation is so transparent that we can easily forget to turn stuff off.
We have a zombie problem.
I’m not the only one noticing this. IBMer Ethann Castell wrote about the problem of undead services just last week. He advocates training, auditing of cloud expense reports, and lots of follow-up to change human habits.
Adaptive Computing’s own Chad Harrington used the zombie metaphor in a different way recently (he was focused on VM sprawl), but I think his prescription is equally wise for the undead service problem: we need intelligent policy.
What I finally did—after my second big mistake—was implement a system where services never had infinite lifespans. Instead, on creation, a cloud-based service had to declare an initial lifetime, and 90 days was its maximum allowable value. The requester also had to identify one or more people who were authorized to extend the service’s lifetime. When the magic expiration day neared, the system would send an email notifying stakeholders that the grim reaper was about to harvest their machine; all they had to do to extend its lifetime was reply to the email.
These emails went out two weeks before, then three days before, then the day of expiration. Four days after the expiration (to account for holidays and long weekends), the system paused the machine that was scheduled for destruction. If nobody complained, then a few days later, true deletion took place.
(Tangent: I see a link between this system, and the concept of pain receptors in code. Interesting that a problem with zombies is that they don’t sense pain…)
Mileage in other production deployments will probably vary. This particular policy might be quite different from what another cloud consumer needs. However, I’m convinced that the general principle—using policy and automation to manage details—is the best way to avoid problems with zombie services.
Adaptive Computing is in the business of intelligent, policy-driven cloud management. Given the ever-lurking threat of zombie apocalypse, it seems like a good business to be in.