Job Schedulers and Malleable/Evolving Jobs

This entry is part 1 of 4 in the series Malleable and Evolving Jobs

Author: Gary D. Brown, Adaptive Computing HPC Product Manager

Introduction

The HPC/technical-computing world has many different kinds of applications users submit as “jobs” to job schedulers, which have traditionally been known as “batch schedulers”. Just as there are many different HPC applications, there are also many different job schedulers in use.

Common commercial schedulers in use on HPC systems include Adaptive Computing Moab, IBM Platform LSF, Altair  PBS Pro, and Univa Grid Engine, with open-source schedulers such as Maui, Slurm, etc, also in use.  Some sites, both academic and government, have developed their own scheduler, such as the Cobalt scheduler in use at the U.S. Dept. of Energy’s Argonne National Laboratory.

 

Gears Cropped 72dpi

In the beginning HPC applications usually solved a problem using an algorithm that required a fixed set of HPC resources, which in the last two decades has increasingly become distributed-memory “compute nodes” (servers designed for computation). Parallel applications and their algorithms executing on these hardware resources required the ability to communicate so they could exchange data between nodes, which the Message-Passing Interface or MPI API successfully standardized. This standard API permitted applications to scale up in size as HPC systems grew over the years.

The Challenge of Scalability and Resiliency

As ever larger HPC systems became available, applications encountered scalability issues with their original algorithms. Application developers optimized their algorithms or developed new algorithms to permit increased scalability, but the new/modified algorithms still required a fixed set of resources. This leads to scheduler issues that could result in lower HPC system utilization, especially when schedulers attempted to optimize job resource allocations based on network topologies.

In addition, larger HPC systems suffered more frequent hardware failures that caused jobs to fail because a compute node no longer worked in a job’s fixed set of resources. Thus, large and/or long-running fixed-resource jobs became increasingly susceptible to failures, resulting in requeued or restarted jobs due to such jobs’ lack of resiliency.

New Programming Models and Frameworks

The lower system utilization and job resiliency problems, among other issues, caused application developers and researchers to look for alternative application implementations. Many people came to similar conclusions that dividing up the work an application had to perform into many small, separately schedulable “tasks” and executing them using a runtime environment or RTE could alleviate these application scalability and resiliency problems. Thus were born RTEs, such as UIUC’s Charm++, Utah’s Uintah, and others, that gave developers new programming models and/or frameworks within which to create dynamic and adaptable applications.

Using these RTEs HPC jobs acquired new capabilities to dynamically adapt to different internal and/or external conditions, expanding and/or contracting their jobs’ resources. This job adaptability has the good effect of making possible job resiliency and full or nearly full system utilization; however, it also has serious implications for job schedulers that schedule only “rigid” jobs with a fixed set of resources. These good benefits are possible only if job schedulers and the new adaptable RTEs perform a dance of cooperation.

Blog Series Purpose

This 3-blog series will describe the basic use cases and nature of RTE- and/or application-based job adaptability so readers have an understanding of new job types RTEs and/or applications enable and in general terms what job schedulers need to do to handle them.

Job Taxonomy

Understanding how adaptive jobs and job schedulers must interact requires an understanding of the different types of jobs possible; i.e., the taxonomy of jobs.

In 1996 the paper “Toward Convergence in Job Schedulers for Parallel Supercomputers”[1], presented at the Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) at the IEEE 10th International Parallel Processing Symposium, defined four job types; Rigid, Moldable, Malleable, and Evolving. The following table, reproduced from the paper, identifies the most distinctive characteristics of each job type.

 

 

When it is Decided

Who Decides

At Job Submission

(Static Allocation)

During Job Execution

(Dynamic Allocation)

User

Rigid

Evolving

Scheduler

Moldable

Malleable

 

The following table describes additional characteristics of each job type. Note the addition of a fifth job type, Adaptive, which is a convenient term for a job that is both the Malleable and Evolving job types.

 

Classification

Description

Rigid

A job that requires a fixed set of resources, all of which
the job scheduler must allocate to the job before it starts (fixed
static resource allocation)

Moldable

A job that allows a variable set of resources for scheduling
but requires a fixed set of resources to execute, which the job scheduler must
allocate before it starts the job and which the job must discover in order to
execute properly (variable static resource allocation)

Malleable

A job that allows a variable set of resources, which the job
scheduler dynamically allocates and deallocates and of which
allocation/deallocation the scheduler must inform the running job so it can
adapt to the new resource allocation (unidirectional, scheduler-initiated
and application-executed, variable dynamic resource allocation)

Evolving

A job that dynamically requests and/or relinquishes
resources during its runtime and of which the job scheduler must be informed
while the job is running so it can allocate or deallocate the resources,
respectively (unidirectional, application-initiated and
scheduler-executed, variable dynamic resource allocation)

Adaptive

A job where the scheduler or the job can
dynamically initiate resource allocation changes during the job’s runtime and
of which the other must be informed while the job is running so they both
keep the job’s resource allocations synchronized (bidirectional,
application- or scheduler-initiated and scheduler- or application-executed, respectively,
variable dynamic resource allocation)

 

The remainder of this blog will discuss the rigid and moldable job types with the other job types reserved for the next blog(s).

[1] Downloadable at http://www.cs.huji.ac.il/~feit/parsched/jsspp96/p-96-1.pdf

Rigid Jobs

A user submits a rigid job with static resources by specifying resources and their quantities for a specific duration, referred to as walltime or wallclock time.

The following job submission in PBS/TORQUE syntax requests four compute nodes for 60 minutes.

qsub –l nodes=4,walltime=60 myJobScript

A job submission is representable on a space/time graph where the Y-axis represents compute nodes and the X-axis represents time. Figure 1 illustrates the job submission’s resources request as a space/time graph.

Rigid Job Space-Time Diagram
Figure 1 – Rigid Job Space/Time Diagram

Note for rigid jobs the user decides the static compute node allocation quantity at job submission time and the scheduler cannot change the allocation.

All job schedulers can schedule rigid jobs; essentially such scheduling of job space/time boxes is like playing Tetris on an infinitely wide strip of compute nodes (space) that scrolls from right to left (future to present to past). From a computational algorithm perspective, scheduling such jobs is similar to a “bin packing” problem with a bin of fixed floor space size (space) and infinitely tall (time rising to the future).

Moldable Jobs

Moldable jobs are the other static resource allocation but with a twist relative to the Rigid job type. The user can specify multiple resource allocation quantities and possibly corresponding times. Some job schedulers, such as Moab, can schedule moldable jobs.

The following job submissions in TORQUE syntax requests four compute nodes for 60 minutes, or two nodes for 120 minutes, or one node for 240 minutes. Note “trl” stands for “Task Resource List”.

qsub –l nodes=1,[email protected]:[email protected]:[email protected]

Figure 2 graphically illustrates the job submission’s resources request with its three alternatives as a space/time graph with three different space/time boxes, each representing one of the three alternatives.

Moldable Job Space-Time Diagram
Figure 2 – Moldable Job Space/Time Diagram

Note for moldable jobs the user decides the alternative static compute node allocation quantities and their walltime estimates at job submission time, from which the scheduler must choose one of the alternatives and after starting the job cannot change the job’s resource allocation.

Some job schedulers can schedule moldable jobs. Essentially such scheduling plays Tetris with different shaped boxes and then chooses one to place in the bin, after which it cannot change the shape.

Malleable and Evolving Jobs

The next two blogs will discuss Malleable and Evolving job types, respectively. Stayed tuned to find out what they can do, what their benefits are, and the challenges they pose for job schedulers.

Series NavigationJob Schedulers and Malleable/Evolving Jobs – 02 >>
Facebook Twitter Email

Speak Your Mind

*