Reflecting on the SC14 TORQUE BoF

The first time I ever visited New Orleans was at Supercomputing 2010. There were many good things about that visit but the cab drivers left such a bad taste in my mouth I have to admit I let it overshadow most of the good memories. For Supercomputing 2014 I prepared myself for those New Orleans cabbies. If one of them told me they accepted Mastercard when I got in the cab only to tell me the machine was broken when they dropped me off I was going to tell them too bad and walk off. Well, something happened over the last four years. Not only did the taxi’s accept my Mastercard but they were much more cordial as well. It has gone a long way towards improving my impression of the Big Easy and helping me remember the good things that happened this year.

Jazz in New Orleans.  SC14

Jazz at SC14, New Orleans

SC14 was the sixth TORQUE Birds of a Feather in a row I have hosted. And in some ways my improved experience with the cab drivers in New Orleans parallels the improved user experience with TORQUE over the same time period. When I started at Adaptive Computing in 2009 TORQUE 2.x had been the work horse for several years as the resource manager for Moab and Maui. But we knew that it was not adequate for the ever increasing demands for performance, scalability and overall user experience of a growing HPC data center. Our early attempts to increase performance, reliability and the user experience were not always successful. Which has created some lively BoF discussions the last few years. But this year the discussion was eerily quiet.

At the start of the BoF we reviewed what we have done over the last year to improve TORQUE. These improvements are in either TORQUE 5.0.1, which is already released, or will be in TORQUE 5.1.0 which will be out soon. The areas of improvement are in testing, new features and Ascent work.

The testing process at Adaptive Computing has matured immensely in just a couple of years. The ability to replicate user environments and just keep adding to an ever growing suite of regressions tests has been key in getting a stable TORQUE out to our users. Without getting into details testing has added a net of 5421 new lines of tests in 320 commits to Github this last year. If you throw out weekends and holidays that is more than a commit a day.

TORQUE 5.0.0 had one new feature added. This feature is Energy Management. Some of the significance of this new feature is that a request for Energy Management capability was requested at Supercomputing 2013 in Denver and we were able to have it available for use only one year later.

Administrators are now able to select energy states for nodes and users can chose energy states for an individual job. Administrators can now chose between five levels of power states on a node using pbsnodes -m. The power levels are Running (the highest), Standby, Suspend, Hibernate and Shutdown (same as shutdown now).

Users can manage the energy of their individual jobs by using the new resource cpuclock. This is used with qsub while submitting a job. For example:

qsub -l cpuclock=1800,nodes=2 script.sh

The cpuclock resource option allows users to request the speed of the cpu using either a number, a Linux power governor policy name or a p-state. The details of each of these methods can be found in the Requesting Resources section of the TORQUE documentation.

With TORQUE 5.1.0 a much improved ability to dynamically add or delete nodes has been incorporated. This allows Moab to request new nodes on the fly as demand increases or free nodes when demand decreases. It also gives users the ability to request a set of nodes for a fixed amount of time and then have those nodes automatically returned when the requested time expires.

Starting with TORQUE 4.2.0 Adaptive Computing embarked on the Ascent project to improve the reliability, performance, responsiveness and fix any issue that keeps TORQUE from being its best. The impact of this ongoing project has been keenly felt with TORQUE 5.0.0. The work done by the Ascent team has made it so pbs_server no longer hangs when things get busy. The work has also reduced processing times for key functions thus improving performance.

We wrapped up the BoF session by talking about things we want to do in the future and then asking the community about their concerns. Surprisingly, the discussion was relatively quiet. TORQUE 5.0 seems to be working pretty well. We even saw some nice comments on the TORQUE mailing list from people who are running Moab 8.0/TORQUE 5.0 and they said it was the best TORQUE ever. So my guess is that with fewer TORQUE problems in the data center the folks in attendance were thinking they might be able to catch the rest of “HPC Systems Engineering, Suffering and Administration” which was next door or they were ready to get on with the annual TORQUE dinner. The dinner by the way was the best attended TORQUE dinner we ever had and of course it was New Orleans so the food was excellent. I will be remembering that part of SC14.

Facebook Twitter Email

Speak Your Mind

*