TORQUE BoF 2014: Past, Present and Future

This entry is part 3 of 7 in the series #HPCMatters

3817086937_ea62e3d28d_zSupercomputing 2014 is almost here. And for the fifth time in the last six years that means it is also time for the Supercomputing TORQUE Birds of a Feather. Aside from the stress of preparing for the meeting I really love this event. My first TORQUE BoF was at Supercomputing 2009 in Portland, Oregon. This was my first chance to meet face to face with the many people I had been corresponding with over the TORQUE users mailing list. To be honest I was pretty nervous. I had been working on the TORQUE project for not quite a year and during that time I made more than my share of mistakes. And the TORQUE community was not shy about pointing out those mistakes either. But after meeting several of those community people I learned that the criticism was meant to be constructive and help make TORQUE a better product to use.

That first BoF in 2009 has set the tone for TORQUE development. Many of the advancements and new features of TORQUE are a result of what has been discussed in the Supercomputing BoFs. Some of these advancements have been in TORQUE so long that they are taken for granted any more. At the 2010 BoF we were able to talk about the addition of Cygwin support thanks to contributions from Cyfronet. We also added enhanced job arrays (thank you Glen Beane), job logging, munge authorization and an XML based serverdb. In that same BoF we announced the addition of GPU support and the intention to add NUMA support for the SGI Ultra-Violet platform. Most of these ideas came out of the 2009 BoF. In the 2010 TORQUE BoF we also discussed plans to make TORQUE multi-threaded. It was the beginning of making TORQUE more scalable.

We did not have a BoF at the Seattle Supercomputing in 2011. We were busy working on TORQUE 4.0 which was not a lot to write home about. But again in 2012 in Salt Lake City the community came together and helped out as we worked to stabilize TORQUE 4.0. This was a critical release for TORQUE. TORQUE 2.5.x had been a work horse but its architecture limited its use in larger systems. At the 2012 BoF users came and shared their stories and problems. There were some successes which gave everyone hope that we were doing the right thing. Best of all the community did not give up on us and we were able to use the 2012 BoF as a springboard to get things working better.

At the Supercomputing 2012 TORQUE BoF Michael Jennings from Berkeley Labs  was able to introduce the Warewulf Node Health Check Project. The people with the Warewulf NHC project noticed that even though many sites created their own node health scripts, most places reused scripts and hacked them up to work with their systems. Often with less than optimal results. The Warewulf developers have provided a rich set of tools that bring reliability, speed, flexibility, extensibility and reuse-ability to the node health check script. Warewulf NHC scripts are common place in TORQUE HPC installations today.

At Supercomputing 2013 in Denver we were able to have a more upbeat BoF because of the advances made for TORQUE 4.x in the previous year. TORQUE was moved to use C++ because it offered better type checking and C++ classes helped alleviated the deadlock problems that were so common in TORQUE 4.0. We moved the TORQUE code to GitHub which improved transparency and made it easier for the community to add fixes and share problems. Because TORQUE is an open source project Coverity added it to its long list of other open source projects to be scanned for static code analysis. When the code was first scanned it revealed well over 1 error per thousand lines of code. Today TORQUE has only 0.17 errors per line of code. Well below the average open source project of its size. At Adaptive Computing we have improved our testing process which has also led to better releases of TORQUE.

Today TORQUE 4.2.8 has become the work horse for TORQUE showing great stability and improvements in scalability. To our pleasant surprise TORQUE 5.0 has already been deployed successfully in many places. TORQUE 5.0 adds even more improvements that we will talk about at this years TORQUE BoF. We will also probably mention a new open source project which looks like it could make TORQUE administration easier as well. It is a GUI for TORQUE developed by David Marsh named TORQUEView.¬† But we know all of this won’t be enough for the ever changing environment of HPC and Big Data. So we hope you will join us this year in New Orleans on Tuesday, November 18, 2014 at 5:30 p.m. in room 293 for our 2014 TORQUE Birds of a Feather. This years session is titled “How TORQUE is Changing to Meet the New Demands for HPC”. We will be presenting some of our ideas of how to meet the new demands but it will take the community to interject its ideas into the conversation to make sure we get it right.

Series Navigation<< The UberCloud Experiment Shows Why #HPCMatters#HPCMatters in Art >>
Facebook Twitter Email

Speak Your Mind