My First Software-Defined Supercompute Cluster

This week at Supercomputing 13, I created a demo of some cool new technology I’m working on, which should drastically improve throughput and scale for customers that have many short-lived jobs. (More on this in the near future.)

In order to demo effectively, I needed an HPC cluster—but shipping one to the exhibit hall in Denver was a bit impractical, so I turned to cloud to solve my problem. This solution is becoming more and more common, as Amazon introduces instances with specially-tuned compute and network configurations, to match the needs of HPC customers. Similar work has been widely reported in the recent past; see these links from The Register, insideHPC, High performance computing at Nomad-Labs, and Ars Technica.

If you’re like me, you have heard these reports in the press, and wondered just how tough and how expensive it might be. When Cycle Computing fired up a petaflop cluster on EC2, how much effort was involved? Did it give an IT staff, somewhere, collective fits? Could this feat be duplicated by mere mortals?

Well, now I know.

The answer is “yes.”

It was amazingly easy.

Supercomputing hardware is cool, as this photo of the ALMA correlator in Chile demonstrates. But it’s hard to beat the amazing flexibility of software-based clusters in the cloud. Photo credit: ESO (Wikimedia Commons)

I enlisted the help of a devops expert to pick an AMI that could be easily puppet-ized and managed with standard IT tools. That was a 20-minute phone call. Then I selected Amazon’s c1.medium instance type, and let the devops guru build a .deb that would fully configure the node as it came up. He went so far is to provide me with a dozen lines of bash script that I could paste into the instance variables on startup; this would convert generic AMIs into instances with all my HPC apps pre-installed. He made this all look simple (thanks, Jonathan!) I was then able to fire up a test instance, experiment for an hour, and see that I was ready. When I right-clicked my sample instance in the AWS console, and chose “more instances like this one,” I found myself in a wizard where I could request 5 new machines, or 50, or 5000.

The only part of the process that wasn’t slick was the process of connecting the nodes in the cluster once they’d been turned on. I experimented with Amazon’s VPC feature, but couldn’t make it work with 45 minutes of debugging; instances were not seeing one another in the way I needed. Probably I could have overcome this problem with another hour of careful study and experimentation, but I gave up and simply accepted default networking instead. That was good enough for my purposes. I also had to do a bit of grunt work (scriptable) to build up a list of host names that allowed the scheduler to send instructions, receive reports, and attribute work correctly.

After this experience, I believe even more strongly than I did before that the HPC~cloud convergence is the wave of the future. When you want a cluster, you’ll be able to plop down your credit card and rent a custom-sized, custom-built one for pocket change. In the world of enterprise IT, the pattern of defining everything in software has huge momentum—you hear folks talk about “software-defined networking,” “software-defined storage,” and “software-defined datacenters.” Will “software-defined supercomputing” be the next big thing?

Facebook Twitter Email