SC’14 Torque BoF Summary

Here it is February, 2015, already and we are still talking about Supercomputing 2014. But we promised to post our presentation and summarize what happened at this years Torque Birds of a Feather (BoF). This year our BoF was up against some pretty stiff competition from next door. The “HPC Systems Engineering, Suffering and Administration” BoF was taking place at the same time right next door to us and many people were torn between these two events. Hopefully this post will be useful to those who wanted to attend the Torque BoF but conflicts and priorities did not make it possible.

Our topic this year was How TORQUE is Changing to Meet the New Demands for HPC. The last year has seen some great progress in the stability and reliability of Torque. But it has also seen marked improvement in its performance and has added some important features. The stability, reliability and performance improvements are a direct result of what we call the Ascent project at Adaptive Computing. The Ascent project has been going on for a couple of years and will continue to be part of the engineering effort at Adaptive Computing. Its purpose it to improve reliability, performance, user experience and identify and fix anything which keeps Torque from being the best resource manager possible.

This year we highlighted changes made through Ascent in Torque 5.0.0. Probably the one most people will notice is a change we called bulkheading. Bulkheads are compartments in the hull of a ship that help prevent a leak in one part of the boat from sinking the entire ship. The bulkheads stop the water from spreading thus containing the leak. For Torque we created new threadpools for pbs_server. Previously, there was only one pool of threads to handle user requests, communicate with MOMs and do background tasks. When pbs_server became busy it would run out of threads and become completely unresponsive. The ship was sunk. Now there are separate threadpools to handle pbs_server to MOM communication, background tasks and user requests. To top it off if 95% of user threads are in use pbs_server replies to regular users that it is busy. This makes it so Torque administrators are always able to get a response from pbs_server when trying to diagnose any problems.

We added one major new feature to Torque 5.0.0. That was energy management. Administrators are now able to set the power consumption levels of individual nodes to best meet the power demands of the data center. Users are also able to request different cpu clock rates for an individual job. For example, if a user has an I/O intensive job that does not need to use the cpu that much they can request that a lower clock rate be applied to the cpu resources they are using, thus reducing the amount of energy needed for the job. Conversely, if a user has a cpu intensive job they can request a higher clock rate so the job will complete more quickly.

Torque 5.1.0 had not been released by Supercomputing. But we did introduce its main new feature which we have called elastic computing. This is the ability to dynamically add and delete nodes from the cluster as needed based on system demands.  Torque has always been able to dynamically add and delete nodes but the application interface lacked many features needed to be able to adequately handle a continual change in compute node numbers. Torque 5.1.0 now works with Moab 8.1.0 to be able to automate the process of increasing or decreasing the number of nodes in a cluster based on demands.

For Adaptive Computing the most valuable part of the Birds of a Feather is the feedback we get from the Torque community. This year was no different. We had a 30 minute question and answer period plus opened the floor for discussion to the community. Just as has been the case in past Torque BoFs the ideas generated in the room will be showing up in the feature set of Torque in the not so distant future.

Facebook Twitter Email

Speak Your Mind