High Throughput Computing Toolset

A couple of years ago I visited the Arlington National Cemetery. It was a humbling experience to see so many graves of those who so bravely fought for and defended the freedoms that enabled a once oppressed people to found and build a great nation. I was impressed with the precision and respect on display at the tomb of the Unknown Soldier. I heard a 21-gun salute off in the distance as part of a funeral that was taking part in a different part of the cemetery. It was an experience I shared with a great number of other visitors.

Before visiting the cemetery you have to pass through the front gate. There are stewards there directing traffic to the various sections of parking, and coordinating the arrival of the many buses of students visiting the cemetery. They need to direct large loads of people to the bus parking area, and individual cars to the parking garage. They may even have to turn away visitors when the parking is full, or direct them to other parking areas. Thankfully those buses full of children can be treated together as a group. Imagine if they had to check every incoming visitor into the park, and instead of granting access to a busload of students all at once, they had to unload the bus and check in each individual student. By the time they got all of the visitors checked in the last in line would find the park would be closing in just a few minutes!

High throughput jobs are like those buses of school children and HPC schedulers can best be used to schedule them as a group and not individually. We’ve been doing a lot of testing lately consistently throwing over 100,000 jobs at Moab to ensure that it can handle that kind of scale. But the drawback is that this creates a lot of unnecessary work and slows down the scheduling of all of the workload and therefore lowers the utilization of the cluster in general. This is a problem all schedulers have to deal with. The solution is to group jobs together and execute them as a batch. You can do this by creating a script to run a bunch of jobs on one node and schedule the batch script in the scheduler. That’s a good start, but isn’t HPC really all about spreading the workload to get things done faster?

That’s where Adaptive Computing’s Nitro comes in. Nitro is a tool that you can use to group jobs together and execute them on a group of nodes. You can schedule a Nitro job through your job scheduler—lined up like the buses at the National Cemetery, or you can run Nitro standalone on a set of computers that are not attached to a cluster. Nitro handles parceling the jobs out to the various nodes and gives you feedback on how the batch is progressing. It will even give you performance information on each job, and if there are any problems you can get detailed information on what went wrong with that particular task. I ran a test on my development machine using a program that calculates pi to a random number of digits between 2,000 and 10,000. I used three VM instances (nitro01-nitro03) and two desktop machines (perf01 and pv-dev-dt which is also coordinating the jobs in addition to running jobs itself). The individual tasks are taking on average about 0.43 seconds. I threw in a job that would go long and put a time limit on it so that it would timeout showing the capability to enforce user configured walltime limits. I also put in a task that has an error in the task definition format—that’s the “Invalid” entry.

NitroJobStatus

You can also check the individual tasks to see when and where they ran, how long they took, and how much memory they consumed as shown below. If the program exited with an error you would be able to see the output from it at well.

NitroTaskStatus

The nitro job file consists of “qsub” like commands. You can specify a wall clock limit, a job name, and the command to be run.

Whether you’ve got a busload of jobs to unload on your cluster, or you just need to get a big batch done overnight on all of the computers in your building, Nitro can help keep the traffic flowing!

Facebook Twitter Email

Speak Your Mind

*