More than just a simple scheduler, Moab HPC Suite provides Cambridge with all the tools they need for workload management, meeting SLAs, and showback and chargeback activities for its heterogenious environment.
Founded 800 years ago, the University of Cambridge in the U.K. is one of the oldest and most respected universities in the world. But it’s also on the cutting edge of technology, and is, in fact, home to a high-performance computing (HPC) center that provides advanced computing and data management services to internal and external customers from both academia and industry. The Cambridge HPC Service is the only pay per use sustainable HPC service in the UK and as such requires a leading edge workload management system. Recently they began an initiative to improve their computing capacity, which would allow them to better serve the university and also offer their services to outside businesses.
In 2006 The Cambridge HPC Service underwent a major restructuring with a large capital investment and a change in business model to become a charge at point of use self sustainable cost center. Their cluster, called Darwin, was the largest high-performance system in the U.K. when it was installed in 2007, and the 20th fastest machine in the world. It had 2,340 Intel Woodcrest cores, peaking at 28.08 Tflops. It’s been an invaluable tool for the school, and the center was able to provide funding for itself through selling cycles on its system. In order to serve its customers effectively, they have strict SLAs to meet, which means that the scheduling of jobs is vital. To make that happen, and to find the right system that would keep track of everything automatically, Cambridge began to research HPC solutions providers. The workload management system also had to grow with the growing business requirements of the HPC centre and be able to meet the large growth demand in terms of system heterogeneity and size. The Cambridge HPC systems have grown significantly since 2006 and now consist of 3 large systems, 1) a 16 TF 1536 core Intel west mere cluster 2) 128 TF single precision GPU cluster and a new 183 TF 9728 core Intel Sandybridge system. The University was looking for a resource management system that could meet its tough service orientated HPC business model but more importantly was looking for a long term HPC partner to help provide this key component of its HPC business for the long term as their needs grew.
The university investigated various commercial offerings. Many of the products from other providers were simply too expensive. In other cases, the vendor only provided scheduling as an afterthought, which wouldn’t give Cambridge the level of support and the collaborative relationship they wanted. They didn’t have the right mentality to meet their needs.
So they turned to Adaptive, to learn more about their commercial products and they found that Adaptive had the product they needed. But, more importantly, Adaptive was truly focused on helping them get the most out of their HPC system. Adaptive’s Moab HPC Suite is a commercially supported and greatly enhanced descendant of the Maui scheduler. The Moab system looked like exactly what they needed, scaling to work perfectly with their larger system.
More than just a simple scheduler, Moab HPC Suite could provide Cambridge with all the tools they needed for workload management, showback and chargeback activities. It would also work with the customized reporting software the university had developed. But, most importantly, Moab would help them manage their ever-growing workload, maximizing the use of their new clusters through its self-optimization capabilities.
Adaptive was very responsive to needs expressed by the university, working collaboratively every step of the way. With the new system in place, Cambridge was ready to take their HPC center to the next level within the community.
Moab proved to be just what Cambridge needed to get the most out of their initial 2006 hardware installation braking new ground in the UK with the first charge at point of use HPC service with Moab as the key resource control system controlling all user SLA’s. It allows them to work with customized sets of policies on different jobs, which is an important upgrade over the Maui system. Moab’s rich policies ensure that workloads are placed in the best way possible to keep throughput as high as possible and speed up jobs. More importantly Adaptive’s support and Moab’s ability to adapt and grow with the business over the last 7 years has been invaluable.
They currently utilize over 11,000 cores, run as 3 heterogeneous HPC systems, with over 800 users from 30 departments. Because Moab can easily scale well beyond these levels, it seamlessly manages the workload to reduce errors while allowing them to serve more users and complete more jobs. This helps them provide a level of service beyond what they were able to do with the Maui system. Large parallel jobs that have been optimized for either the Woodcrest or Westmere systems can even overlap between CUs.
In meeting their SLAs, Cambridge has also benefitted from Moab’s ability to prioritize jobs based on deadline, rather than simply the order of submission. It adjusts their priorities in real time to better meet their service guarantees.
Of particular importance is the ability of Moab to optimize utilization rates, which is what really determines how much they will get from the system. Cambridge was thrilled to see their utilization rates increase to an average of 87 percent over an extended period of time, which was exactly what they were hoping for. They are also able to let their jobs run with very little need for human attention.
Adaptive has continued to work closely with the university to monitor and handle any issues that arise. Adaptive’s 24-hour service and continuous support make it easy for Cambridge to keep up with changes. It’s this level of attention that will keep Cambridge working with them, just as much as Adaptive’s excellent products.
Download the complete case study