Job Schedulers and Malleable/Evolving Jobs – Part 4

This entry is part 4 of 4 of 4 in the series Malleable and Evolving Jobs

Author: Gary D. Brown, HPC Product Manager

Introduction

In Part 3 of this 4-part blog series, we discussed and took an in-depth look at the benefits of scheduling malleable, evolvable, and adaptive job types.

This blog, Part 4, discusses the minimal basic interactions between a scheduler and applications or run-time environments (RTEs) necessary to support malleable, evolvable, and adaptive job types. In addition, we will discuss a little more in-depth about possible variations on the basic interactions and why they are important and desirable. We will then conclude with a short introduction to the PMIx project and how you can help by taking a survey about your application(s) and/or RTE(s).

In this blog I will use the term “adaptive” in the generic sense to refer to malleable, evolvable and/or adaptive applications, RTEs and interactions, and then where necessary will use the terms malleable and evolvable to refer to specific individual situations applicable only to malleable or evolvable job type scheduling interactions. I will also use the term “application” to refer in the generic sense to applications and RTEs.

Adaptive Jobs and Scheduler Interactions

Malleable and evolving jobs present different scheduling situations to schedulers. This basically lies with who, the scheduler or the job, initiates the dialog for resource negotiation.

When a scheduler initiates a dialog with a malleable application, the application must respond to an external condition that is internal to the scheduler. When an evolving application initiates a dialog with the scheduler, the scheduler must respond to an internal condition of the application, about which it knew nothing until the initiation of the dialog and to which it must respond with resource scheduling.

Note there is nothing in a scheduler/application resource negotiation dialog that indicates only compute nodes are the resources of interest; in fact, any resource can be the target of a dialog (e.g. licenses, bandwidth, etc).

Scheduler/Malleable Application Interaction Dialogs

Schedulers can handle and schedule malleable applications the most easily since the scheduler initiates the resource negotiation dialog with the application and “knows” exactly what it wants to do.

Malleable “Expand” Dialog

In the malleable “Expand” operation, the scheduler notifies the malleable application it has an increase in its resource allocation, the application adapts its internal operation to the request, and then responds to the request in the affirmative or negative. Figure 1 illustrates the functional flow of this interaction dialog between a scheduler and a malleable application.

Expand - Basic

Figure 1 – Expand Malleable Job Functional Flow Diagram

The scheduler has resources become available (1), tentatively allocates some or all of them to the malleable job (2), and notifies the malleable application of the additional resources (3). The malleable application “starts” the resources (for an MPI job this could be the wire-up of the additional resources, for an application with a master/worker architecture the new workers contact the master, etc) and on successful completion of the “start” of the resources, notifies the scheduler it has successfully started the resources. The scheduler then commits the additional resources to the job’s allocation and notifies the application the resources are committed to the job, at which point the application actually uses them.

Error Conditions and Possible Responses

In dialogs between a scheduler and a malleable application it is entirely possible for error conditions to occur. Obviously there must be dialogs that handle such errors. Figure 2 illustrates one possible dialog for an “expand” operation where the application’s error handling policy is “all or nothing”, which means if no error occurs the application keeps and uses all of the additional resources but if an error occurs it returns all of them to the scheduler.

Expand - Failure

Figure 2 – Expand Malleable Job “All or Nothing” Error Handling Functional Flow Diagram

Figure 3 illustrates a possible dialog for an application error handling policy of “use what I can” to keep the additional resources it successfully “starts” and to give back those it does not.

Expand - Partial

Figure 3 – Expand Malleable Job “Use What I Can” Error Handling Functional Flow Diagram

How a malleable application handles error conditions will depend on the application’s design. A scheduler/malleable dialog should handle commonly expected error handling policies such as the two mentioned above.

Malleable “Contract” Dialog

In the malleable “Contract” operation, the scheduler notifies the malleable application it has an decrease in its resource allocation, the application adapts its internal operation to the request, and then responds to the request in the affirmative or negative. Figure 4 illustrates the functional flow of this interaction dialog between a scheduler and a malleable application.

Contract - Basic

Figure 4 – Contract Malleable Job Functional Flow Diagram

The scheduler needs resources for other jobs (1), identifies the resources it needs from the malleable job (2), and notifies the malleable application of the resources it wants (3). The malleable application “stops” using the resources and on successful completion of the “stop” (4), notifies the scheduler it has successfully stopped the resources and is no longer using them (5). The scheduler then deallocates the identified resources from the job’s allocation (6) and then allocates them to other jobs (7).

Evolvable Application/Scheduler Interactions

Scheduling a growing evolvable application is the most difficult scheduling to perform since the scheduler has no idea when an evolving application wants to grow or by how much. In contrast, handling shrinking evolving applications is quite easy.

Evolvable “Grow” Dialog

In the evolvable “Grow” operation the application notifies the scheduler it requires an increase in its resource allocation, the scheduler finds and allocates additional resources and then notifies the application. The application starts the resources and notifies the scheduler it has successfully started the resources, after which the scheduler commits the resources to the application’s job and informs the application of such. Figure 5 illustrates the functional flow of this interaction dialog between a scheduler and an evolvable application.

Grow - Basic

Figure 5 – Grow Evolving Job Functional Flow Diagram

The application requires more resources to continue the job (1) and asks for scheduler for the additional resources. The scheduler allocates the resources to the job and notifies the evolving application. The application starts the resources and informs the scheduler of its success, after which it commits the resources to the job and notifies the application in return.

Evolvable “Shrink” Dialog

In the evolvable “Shrink” operation the application notifies the scheduler it no longer needs part of its resource allocation and the scheduler shrinks the job’s resource allocation accordingly. Figure 6 illustrates the functional flow of this interaction dialog between a scheduler and an evolvable application.

Shrink - Basic

Figure 6 – Shrink Evolving Job Functional Flow Diagram

The application has idle resources it has stopped (1) and notifies the scheduler it can take the resources back (2). The scheduler deallocates the resources and notifies the evolvable application it has reclaimed the resources, after which it allocates them to other jobs.

Note it is entirely possible to have a different dialog if the scheduler desires to tell the evolvable application which resources it wants to take back, which it may want to do for several reasons, such as choosing resources topologically close to the resources of another job to which it can give the returned resources. Such a dialog would obviously require and additional two notifications. The application would have to inform the scheduler of how many of each resources it wants to return, the scheduler would notify the application in a new message of the specific resources it should return, the application would notify the scheduler in a new message it has stopped the resources, and the scheduler would notify the application it has deallocated the resources from the evolvable application’s job.

Scheduler/Adaptive Application Interactions and Race Conditions

Since the scheduler and an adaptive application can both initiate separate dialogs simultaneously, a “race” condition can exist where both are trying to effect a change in the application’s resource allocation. Obviously only one can prevail and it would be wise to always have the scheduler or the application prevail. Figure 7 illustrates simultaneous expand and grow dialogs where the application’s grow request prevails.

Adaptive - Race

Figure 7 – Shrink Evolving Job Functional Flow Diagram

Need for a Standard Scheduler/Application Interaction API for Applications

Today with my wife I visited the National Museum of the Pacific War in Fredericksburg, Texas before the start of the Supercomputing 2015 conference/tradeshow in Austin, Texas. This most excellent museum described, showed, illustrated and explained the Pacific theater of World War II. One interesting part of the exhibits about each battle for an island was the “Lessons Learned” by the militaries of the Allies. Going through the exhibits I became aware that each progressive battle involved more and more cooperation between the different military branches of the United States.

At first the Navy, Marines, and later the Army and Army Air Corps (predecessor to the Air Force), each worked pretty independently (stay out of my sandbox) and battles were long and difficult. Later as the different branches of the military began to work together and coordinate their efforts, the battles generally seemed shorter and less difficult, although there were exceptions. In the end they seemed to be working fairly well together toward the best overall military outcome instead of the best outcome (control and bragging rights) for an individual branch of the military.

I believe this shows what can happen when different groups with vested interests in something work together for the greater good.

The HPC industry has passed the point where malleable and evolvable jobs aren’t just interesting but are necessary to take full advantage of HPC systems and get more science and engineering done more quickly. To this end, adaptive applications need to interact with schedulers in a standard manner so developers only code once for such interactions. No developer wants to waste time and money coding the same or slightly different interactions with different schedulers, and several development groups have personally told me this.

It is time for the scheduler developers and vendors and the adaptive application developers and vendors to come together for the greater good of the HPC industry by developing and adopting a single standard API usable by adaptive applications to interact with schedulers for dynamic resource management.

PMIx Project

This blog has discussed basic scheduler/adaptive job interactions, why schedulers may want more flexible advanced interactions, and the need for a standard scheduler/adaptive job interaction API for applications and RTEs.

Ralph Castain of Intel started the open-source PMIx (Process Management Interface for eXascale) project (https://github.com/pmix) to solve scalability issues associated with exascale systems. Major technology and scheduler vendors collaborating in this open-source project are Intel, Mellanox (InfiniBand), IBM (Platform LSF), Adaptive Computing (Moab and TORQUE), and SchedMD (Slurm). The current collaborators expect other major technology and scheduler vendors and developers to join over time.

In the interest of disclosure, I participate in the PMIx project on behalf of Adaptive Computing and out of personal interest in seeing a standard API for resource management dialog with schedulers that adaptive application developers would use.

The PMIx project is not a part of any MPI project and has no overlap with MPI. It is an open-source project in its own right with the goal of providing exascale-capable software capabilities anyone can use for any purpose without restriction. Examples are packing/unpacking data transmitted and received as a “blob” to speed communications, “instant on” job startups measured in very few seconds instead of many minutes, extremely fast and efficient broadcast and collective operations, etc.

The PMIx project has as one of its goals a standard scheduler/adaptive application interaction API so adaptive application developers code to a single API only once, regardless of the scheduler an HPC site may use to schedule adaptive jobs.

To illustrate this goal with a current, very successful ex ample, note the Message-Passing Interface (MPI) API is a standard that is the same for all MPI-based applications but MPI library developers/vendors develop their own implementation that adheres to the MPI API standard and thereby compete solely on their different implementations, not on API differences; thus, making life much easier for the developers of parallel applications that must communicate.

The goal for a standard scheduler/adaptive application API is to function for adaptive applications and their resource management interactions with all schedulers in the same manner as MPI does for applications and their message communications. This means adaptive application developers will use a single standard application-side API to interact with a scheduler for the purpose of adaptive resource management. Scheduler developers/vendors will implement the scheduler-side of the standard API with which the application-side API will interact and thereby compete based on their implementations and scheduler capabilities, not on API differences.

Malleable and Evolving Application/RTE Use Case Survey

The PMIx project desires first to understand the current adaptive applications’ and RTEs’ “interaction domain space” before attempting to collaboratively define a standard Scheduler and Malleable/Evolvable Application Dialog (SMEAD) API for adaptive resource management; i.e., the PMIx project wants to “look before it leaps”.

To help the PMIx project design a standard scheduler/adaptive application interaction or dialog API, we would like to know and understand the characteristics and types of interactions adaptive applications are currently capable of performing, will perform in the future, and/or desire to perform. In other words, we want to gather use case information relative to adaptive applications/RTEs and their resource management needs, desires, and interactions, which is the reason for a survey I have created on behalf of the PMIx project.

If you are a developer of adaptive applications or RTEs you can assist the PMIx project in gathering use case information by participating in the survey. If you are not a developer or associated with an adaptive application or RTE but know someone who is, please ask them to participate in the survey.

The survey is available at http://goo.gl/forms/lq85y3SkV3 (Google Form).

Conclusion

I hope you have enjoyed this 4-part blog series and learned a thing or two about adaptive job types and scheduler interactions and the need for a standard API for adaptive resource management dialogs. And if you have use case information for adaptive applications and desired resource management interactions with schedulers, you can contribute to the forward progress of the HPC industry by taking the survey.

Thank you for your attention and I hope to connect with you again in the future!

Series Navigation<< Job Schedulers and Malleable/Evolving Jobs – 03
Facebook Twitter Email

Speak Your Mind

*