SC13 Panel: Programming and Managing Systems at Scale

I had the distinct honor of moderating a great panel at Supercomputing 2013 called Programming and Managing Systems at Scale. I was a little nervous, being that I was by far the most junior person present, and the intellectual prowess of our panel was paramount—even for anyone attending Supercomputing. The panel was full of heavy hitters in the HPC industry, both on the system management side, and the vendor side. The panel included:

  • Don Maxwell, HPC Team Leader: Oak Ridge National Laboratory
  • Bill Kramer, Deputy Director of Blue Waters: NCSA
  • Larry Kaplan, Chief Software Architect: Cray
  • Jim Cownie, Intel Principle Engineer & ACM Distinguished Engineer: Intel
  • Michael Jackson, Founder and President: Adaptive Computing
  • Moe Jette, Founder and CTO: SchedMD

The most difficult part in the end was getting everyone together to discuss topics and strategy. It ultimately worked out, and at Adaptive Computing’s conference rooms at SC13, we were able to meet to finalize our topics. We had an interesting mix on the panel, and one of the topics that I wanted to discuss would have gotten special attention from one of our panelists in particular (Bill Kramer), but was previously addressed at the conference, which is the problem with measuring a supercomputer’s performance based on LINPACK results. That being said, we had some great topics that spanned both the software and hardware side of the equation.

In addressing complex networks and network topologies, Adaptive and Cray may have had a special “advantage” on the panel, given that they are working to develop topology-based scheduling focusing on Cray’s 3D Torus topology. That being said, this topology based scheduling was tested on the two large systems represented on the panel: Oak Ridge’s Titan and NCSA’s Blue Waters. Adaptive had a live demo, visualized by our All-Spark Cube, which showed the results of the topology based systems. During the panel, multiple panelists had noted that it will become more important to measure a supercomputer based on output, or work done, than on floating point operations per second (FLOPS).

The topic of power management drew quite a few questions from the audience, and in particular, Jim Cownie noted that Intel has made advancements in managing power consumption on the processor level, but that the greatest gains won’t come from processor power consumption, given it’s relatively low power consumption compared to other components, such as memory. As an ever-increasing concern in Europe, it wasn’t surprising to get questions from our international audience. It appears there is no great answer to this very difficult question, even by managing it at the software level. Addressing this question was Michael Jackson, who talked about power management capabilities in the job scheduler, in our case, Moab. He discussed how it’s possible for an advanced scheduler to schedule with many metrics in mind—perhaps even power consumption—but the more switches or configurations enabled greatly increases the complexity of the system, therefore the reliability and performance. Michael answered that the automation of having machines in idle states, therefore taking advantage of low power consumption, or turning off the machine entirely during periods of rest, is what Moab can do best to help solve the power consumption issue. Moe Jette was unable to be at the first hour of the panel, but when addressed with a similar question, noted some development work on plugins for SLURM by a customer site to hopefully improve the power management capabilities of the SLURM resource manager.

Overall, the panelists and I felt it was a successful panel discussion, and we had some of the best possible panelists available to us for the discussion. While we shied away from talking specifically about exascale, LINPACK measurement, or future milestones, it was obvious that the members of this panel—who manage some of the largest systems in the world—including Titan and Blue Waters, and the vendors responsible for those systems, including Cray, Intel and Adaptive Computing, have the problems and challenges in mind for future development, ensuring that future systems and software will answer the problems faced with programming and managing large scale systems.

Facebook Twitter Email