What Doesn’t Need to be Said When Moving Forward

800px-North_faceLooming over the skiing village of Kleine Scheidegg on the northern edge of the Swiss Alps is an iconic mountain of brittle limestone covered with snow and ice and blasted by the wind of oft-occurring ferocious storms. The Eiger is a legendary climbing destination that has seen triumph and tragedy, heroism and heartbreak, daring feats of skill and just plain bad luck. The first ascent of the Eiger was made along its western flank by British mountaineer Charles Barrington in 1858. In the Victorian era, the only point was to arrive at the summit—not to conquer all of the challenges that the mountain presented. In August of 1935 the first attempt to climb the north face was attempted by the German climbing team of Karl Mehringer and Max Sedlmeyer. They began at two o’clock in the morning and made great progress climbing around 50% of the mile high face that first day. Two more days followed with marginal gains, then two days of storms set in. They were seen climbing on the fifth day, again making marginal gains, but then more storms rolled in and they were trapped. The first successful attempt came three years later by a team of four climbers. They finally conquered the foreboding north face in three days.

Climbing techniques and equipment have changed dramatically since the early twentieth century. In 2008 after training for a year specifically to beat his previous record, Swiss climber Ueli Steck began his ascent not in August, but in February. He carried a single rope with him, a small daypack, and two specially designed ice axes. Instead of leaving at two o’clock in the morning, he set off at nine and reached the summit before lunchtime—two hours and 47 minutes later.

The history of HPC scheduling systems seems to be taking a parallel path to the climbing history of the Eiger. Some schedulers take the easy route to get a job done, but don’t take advantage of the full capabilities of the cluster. In the past, Moab has found ways to conquer the most difficult scheduling problems while using as much of the cluster as possible. Today, like Ueli Steck, we are conquering the challenges of scheduling much faster than our predecessors.

The Adaptive Computing Ascent team has been busy creating two new methods to achieve the speed increases in the next version of Moab:

  1. Reduce unnecessary communication between TORQUE and Moab
  2. Move resource manager updates to background threads

First, consider a job queue with 100,000 jobs on a cluster that has 2000 nodes. The cluster can only handle a fraction of the total jobs at one time, so a lot of jobs (probably more than 90,000) are just sitting in the queue. Most users submit jobs to TORQUE with qsub (which is currently faster than submitting them to Moab directly). At the beginning of each scheduling iteration Moab polls the resource managers to get the current state of the cluster including the status of running and waiting jobs. With 100,000 jobs in the queue, that’s a lot of information that has to be transmitted. What’s tragic about this situation is all of the jobs waiting in the queue haven’t changed since the last time that Moab asked for it. Our TORQUE architect, David Beer, came up with the brilliant idea to reduce all of this unnecessary communication. If nothing has changed with the job within a specified time period, then just send the bare essentials to let Moab that the job is still there.

TORQUE has implemented a new configuration option: “job_full_report_time”. The default time is 45 seconds but can be changed using qmgr. TORQUE notes the time of state changes on a job. During the “Full Report” window, TORQUE will report the full job information to Moab. After the window has passed, the condensed format will be reported. After a large job queue is submitted, Moab will have a couple of polls to TORQUE that will take the usual amount of time, but after that tens of thousands of jobs will be able to report the condensed information. To get the most out of this change, make sure that Moab’s RMPOLLINTERVAL setting is less than or equal to TORQUE’s “job_full_report_time” window.

You can also use this feature when manually checking TORQUE’s status by using the “-C” option with qstat.

Secondly, we changed Moab’s polling algorithm to poll all of the resource managers (in the case you have more than one) at the same time. We also do this on a background thread so we can continue processing client commands until the polling has completed. This makes Moab more responsive to users during the beginning of the scheduling iteration. Unlike the first change, this technique is used on all resource managers, not just TORQUE.

To test the changes we loaded up a system with Moab and TORQUE on separate servers and used 1500 nodes. We put 100,000 small jobs in the queue and when the total jobs in the queue dropped below 90,000 we would submit another 15,000 jobs. We measured the scheduling iteration times and found that the new changes gave us a maximum schedule iteration 3.5 times faster than Moab 8.0/TORQUE 5.0 and an average iteration time of 2.3 times faster.

Sometimes it’s better to leave some things unsaid, and like Ueli Steck said of his record attempt—you just have to keep moving forward!

Facebook Twitter Email

Speak Your Mind