How to Cloudify Your Software, Part 4: Check Those Vitals

[Author’s note: This is one of a series of posts about how developers can design and code differently to take advantage of the cloud. Check out the rest of the series: Part 1: Ender’s Homework, Part 2: Get Out Your CB, and Part 3: “Do you want fries with that?” Please subscribe, follow me on Twitter or Google+and check back weekly to explore the topic in full.]

When software is sick

Earlier this week, I stumbled on an interesting discussion on the OpenStack mailing lists. A user wanted to know why he had observed a significant degradation in the performance of OpenStack’s Swift Object Store. He provided a few statistics and some troubleshooting output, and developers began to weigh in. Last time I checked, the leading theory was a suboptimal usage pattern by the caller of Swift’s APIs. Basically, a consumer of a web service was innocently working the system in a way it had not been designed to handle efficiently.

The most troubling aspect of this incident, in my opinion, is that the first person to notice a problem was an end user.

Software shouldn’t work this way. Especially cloudified software.

Imagine sitting at a table in a restaurant, and having a waitress approach you to take your order. As she introduces herself and whips out her notepad, you notice that her skin is covered with oozing sores that look unhygienic. She gives a little cough, and her hand is trembling. Yet she behaves as if all is normal.

“Do you feel okay?” you ask. “You look… under the weather.”

“I feel the same no matter how sick or healthy I am,” she says matter-of-factly. “Health is a non-issue for me. Shall we continue?”

Health is a big issue

Such a waitress has bigger problems than just losing a customer when she has an off day. Because she doesn’t really have any way to tell the difference between sickness and health, she can’t take preventive action to protect others. Her job performance is seriously compromised. I certainly wouldn’t want to order from her.

Surprisingly, most software is just as naive as our fictional waitress. It expends little or no effort on monitoring its own vital signs or the conditions of the environment, and it has no plan to react if things go wrong, other than to log an error and hope it gets the user’s attention.

Checking vitals regularly is a best practice for all kinds of health professionals. Photo credit: U.S. Pacific Fleet (Flickr)

When software is hosted in a cloud, the need to assess the health of the environment and the current running process is more important than ever, because:

  • Sysadmins in the host data center are unlikely to do this for you. Their focus is on pooled infrastructure. If the pooled infrastructure goes down, history demonstrates that they’ll have bigger problems than your tiny corner of their universe. They may not notify you for hours.
  • One of the value propositions you’re aiming for is flexibility. You can’t be flexible unless you perceive a need to adjust to changing circumstances.
  • Your route to the application may be much different from the route of users on the other side of the world. (Contrast the “old days” when IT gets to servers through the same network pipes as their users, so they’d notice problems as soon as users did.) This means that everything might look great to you, but not to them. And to make matters worse…
  • Your users may not have an easy way to get your attention.

What does healthy look like?

If you want to cloudify your software, I recommend that you spend some time pondering this question. If your software depends on connectivity to a database, for example, what is a reasonable level of service to expect from that connection? If a suite of standard queries normally runs in 100 milliseconds, and one day it suddenly starts taking 3 seconds instead, perhaps you need to consider this a symptom of illness. If you normally have 100 GB of free disk space, and one day you notice that you only have 5 GB left, perhaps this is cause for concern as well.

Consider running health checks on background threads at an interval that’s practical for your application. Perhaps once an hour is often enough to guard against disk overflow, but RAM overload needs to be monitored minute-to-minute.

Answer this question conscientiously, and make sure the software itself, plus all the people in your software’s value chain, can diagnose an unhealthy patient.

Techniques to stay healthy

I have previously blogged about building pain receptors into software, and about using the circuit breaker pattern to react to temporary problems in a way that doesn’t make matters worse, and that automatically reverts back to normal when a system self-corrects.

So get out your sphynomanometer and start measuring the vitals that are the best indicators of your application’s health. Don’t leave health monitoring to a customer; they might just lose their appetite.

This is part 4 of a 7 part series, click the links below to view the rest:
How to Cloudify Your Software, Part 1: Ender’s Homework
How to Cloudify Your Software, Part 2: Get Out Your CB Radio
How to Cloudify Your Software, Part 3: “Do you want fries with that?”
How to Cloudify Your Software, Part 4: Check Those Vitals
How to Cloudify Your Software, Part 5: Auto Install
How to Cloudify Your Software, Part 6: Re-imagine Your Data
How to Cloudify Your Software, Part 7: Raise the Bar

Facebook Twitter Email