Job Schedulers and Malleable/Evolving Jobs – 03

This entry is part 3 of 3 of 4 in the series Malleable and Evolving Jobs

Author: Gary D. Brown, HPC Product Manager

Introduction

In Part 3 of this 4-part blog series, we discussed and took an in-depth look at the malleable, evolving, and adaptive job types. If you missed Parts 1 or 2, click here to read Part 1 and here for Part 2.

This blog, Part 3, discusses the benefits of scheduling malleable, evolving, and adaptive job types, all of which are dynamic job resource allocations, versus scheduling only rigid and moldable job types, both static job resource allocations.

Benefits

There are many benefits to a scheduler handling malleable and evolving (and adaptive) job types. A few are system utilization, job resiliency, proactive system fault-tolerance, job preemption, and backfill alternative.

System Utilization

System utilization is a measure or metric that identifies the percentage of compute resources in use in an HPC system. This metric often takes two forms; current and time-based.

Current system utilization is the utilization of an HPC system’s resources at the current instant in time. A system administrator can look at existing compute node resources and determine from the quantity in use by jobs the “current” system utilization. For example, an HPC system with 1000 compute nodes and 853 in use by jobs at this very moment has a current system utilization of 85.3%.

Time-based system utilization involves the percentage of system resources in use over time. For example, the space/time diagram in Figure 1 shows there are 8 compute node resources available over a 240-minute interval, yielding 8 × 240 = 1920 node-minutes of available capacity during the interval. If jobs used only six of the compute nodes during the entire interval, the jobs used 6 × 240 = 1440 node-minutes for a system utilization of 75.0%.

To illustrate a major use case for schedulers supporting malleable, evolving, and adaptive jobs, let’s look at the problem of system utilization when a scheduler only supports rigid jobs (moldable jobs are rigid jobs when they execute) versus when it supports malleable and evolving jobs as well.

Rigid Job Scheduling

Figure 1 shows jobs A through G are rigid jobs. Note the diagram starts with rigid jobs A, B, and C already executing. Users submit additional jobs “D” through “H” at the times indicated by their respective circles atop the space/time diagram.

Note job D actually has the capability to evolve since it uses 1 node for 30 minutes and then 7 nodes for another 30 minutes, as illustrated by the cross-hatched area indicating 6 nodes not in use during the first 30 minutes of the job. However, since the scheduler in this example does not support evolving jobs, the user must submit it as a rigid job requiring 7 compute nodes for 60 minutes.

System Utilization for Rigid Jobs

Figure 1 – Rigid Jobs System Utilization Space/Time Diagram

Figure 1 shows jobs A-G utilized 75.8% of its 1920 node-minutes capacity. However, the evolving job D actually used only one compute node for its first 30 minutes and then all 7 compute nodes for its last 30 minutes. This means the system utilization is really 66.4% since job D does not actually utilize 180 of its 420 node-minute allocation.

Note the rigid jobs-only scheduling completed 7 jobs in the 240-minute interval.

Rigid + Malleable Job Scheduling

When an HPC system’s job scheduler supports malleable jobs, the job scheduler can adjust a malleable job’s resource allocation so the HPC system operates at or near 100% system utilization, as demonstrated by the diagram in Figure 2.

System Utilization for Malleable Job with Rigid Jobs

Figure 2 – Rigid + Malleable Jobs System Utilization Space/Time Diagram

Figure 2 shows that jobs A-G and job M, the malleable job, utilized 100% of its 1920 node-minutes capacity. However, the evolving job D actually did not utilize 180 node-minutes, meaning the system utilization is really 90.6%.

Note job G started and finished later than with rigid-only job scheduling because job M had to have at least one compute node, which job G required.

Notwithstanding, this rigid + malleable jobs scheduling completed 8 jobs since the malleable job finished during the 240-minute interval; thus, making the HPC system more productive because of its increased job throughput.

Rigid + Malleable + Evolving Job Scheduling

When an HPC system’s job scheduler supports evolving jobs, such jobs can request changes in their node allocation so they use only what they need instead of requiring the user to request the maximum resources and then not using them during the job’s entire lifespan. This allows a job scheduler to start a growing evolving job earlier than it could if it had to schedule the job as a rigid job. In addition, a shrinking evolving job permits the scheduler to use resources the job no longer needs to start other jobs earlier than it otherwise could.

 

System Utilization for Malleable and Evolving Jobs with Rigid Jobs
Figure 3 – Rigid + Malleable + Evolving Jobs System Utilization Space/Time Diagram

Figure 3 shows that jobs A-H and M utilized 98.7% of its 1920 node-minutes capacity because job M finished and there were no malleable jobs to use any unused compute nodes. However, because job D had to wait for 15 minutes between its finishing its 30 minutes on one compute node and when it received the additional 6 compute nodes it required, job D actually did not use 15 node-minutes and therefore the real system utilization is 96.6%.

Note the scheduler started evolving job D immediately since it only needed one compute node to start, which permitted job D to finish earlier and therefore jobs E, F, H and M to finish earlier as well.

Note further job G still started and finished later than with rigid-only job scheduling because job M still had to have at least one compute node, which job G required.

This rigid + malleable + evolving jobs scheduling completed 9 jobs since job H started and finished during the 240-minute interval; thus, making the HPC system even more productive because of an even higher job throughput rate.

Job Resiliency to Hardware Failures

One of the best use cases for scheduling malleable and evolving jobs is the potential for job resiliency. With a rigid job, if a node fails, the job must terminate since it normally has no way to recover from the failure due to its statically allocated resources, on which its programming model and algorithms usually depend.

If a job has an adaptive programming model, one that can handle variable quantities of resources during the job’s execution, handling a node failure is as easy as removing the node from the job’s resource allocation. Either the scheduler or the job’s application or runtime environment can make the adjustment. In fact, some applications and runtime environments can detect a node failure faster than the job scheduler can notice in the HPC system.

Proactive System Fault-tolerance

The scheduler for an HPC system can usually examine many different factors related to the health of the system, such as node temperature, fan speed, etc. If an administrator sets up the appropriate policies, the scheduler can detect an impending node failure and take proactive action to prevent its use. If a malleable or adaptive job is using the node in question, the scheduler can tell the job to stop using the compute node; thus, proactively preventing jobs from even needing to deal with failed hardware.

Job Preemption

One common technique used to run high-priority jobs that arrive in the job queue after the scheduler starts a low-priority job is for the scheduler to “preempt” the low-priority job in order to vacate its resources and use them instead for the high-priority job. This can result in wasted resources if the preempted job cannot restart from where it left off at the time of its preemption but, instead, must start over when the scheduler starts the job again later.

If the low-priority job is malleable and the high-priority job requires fewer resources than the low-priority job is using, the scheduler can instruct the job to contract its resource allocation to free the resources needed by the high-priority job. This permits both the high- and low-priority jobs to execute simultaneously and still satisfy the high-priority job’s service-level agreement guarantees.

Backfill Alternative

Another common technique a job scheduler may use to utilize a system more fully is “backfill”. When a scheduler detects there is or will be a set of unused resources for a period of time, it can schedule smaller and/or shorter jobs that will fit on the resources within the time period to increase system utilization and job throughput. However, sometimes this technique backfires when the smaller/shorter jobs run longer than expected, which has the unwanted consequence of pushing back the start of larger or higher-priority jobs.

With malleable job scheduling, a scheduler can expand a running malleable job’s resource allocation so it uses the set of unused resources for just the length of the time period; thus, increasing both system utilization and job throughput without delaying the expected or guaranteed start time of larger and/or higher-priority jobs.

Schedulers, Applications, and the Need for a Standard Scheduler and Malleable / Evolving Application Dialog API

The next blog will discuss how Schedulers and Malleable/Evolving/Adaptive applications need to dialog in a standard way in order to support the malleable, evolving, or adaptive applications created using new programming models and techniques such as task over-decomposition with dynamic runtime environments, adaptive mesh refinement, etc. Stayed tuned!

Series Navigation<< Job Schedulers and Malleable/Evolving Jobs – 02Job Schedulers and Malleable/Evolving Jobs – Part 4 >>
Facebook Twitter Email

Speak Your Mind

*