Understanding Moab Scheduling: Part I

This entry is part 1 of 3 in the series Understanding Moab Scheduling
Moab Scheduling Cycle

Moab Scheduling Cycle

With MoabCon, Adaptive Computing’s yearly user conference, just around the corner, I thought I’d revisit the subject of a well-received talk I gave two years ago at the conference. This will be done in three parts covering the Moab scheduling cycle, the proper use of mdiag -S and finally a simple example of scheduling in action.

Hopefully, this will give you a better understanding of how Moab does scheduling and how the different policies affect its behavior.

The Scheduling Cycle

Moab’s main task is to schedule workload. This is its number one priority at all times. In order to do this, it goes through a series of steps called the scheduling cycle. While there are technically over ten steps in this process, we are going to simplify things by categorizing all of them into five main steps. Each of these will be discussed separately below.

The Polling Interval

Before we get into the steps, however, there is one other important topic to quickly cover. If you take a look in your moab.cfg file, you will likely find a line that looks like the following:

RMPollInterval          30

or…

RMPollInterval          5,30

This is the RMPollInterval, which many believe is used to specify how long a full cycle is supposed to take. This assumption is mostly correct, but let’s get into what it really means.

The RMPollInterval can be specified as either one or two numbers, both of which are a number of seconds. If only one number is specified, it represents the maximum amount of time Moab will wait before attempting to force a new iteration to start. This is important, as many people think this single number is setting a fixed time for the iteration. That is technically incorrect, though in a properly balanced system that is the behavior one would expect to see. The reason it is correct is there are several different things that can cause an iteration to start early. For example, if an administrator issues a mschedctl -r command, a new cycle will be forced to start regardless of the RMPollInterval setting.

In the case where there are two numbers specified, the second number is the same as in the case where only one number is specified. The first number, however, represents the minimum time allowed for a scheduling iteration. On some systems, the events discussed above that can cause a new iteration to start early can become a problem if they happen to often. In these cases, the system gets bound up constantly scheduling because it is always starting a new cycle, which results in it becoming unresponsive to user requests. By adding a minimum time to RMPollInterval, the administrator is telling Moab that a cycle must take at least that long, even if an event attempts to start a new one early.

Now that we’ve talked about the time of the cycle, let’s look at the individual stages and what happens during them.

Step 1. Update Information from Resource Managers

The very first thing Moab does during a scheduling cycle (also known as a scheduling iteration) is to attempt to get a coherent understanding of the world around it. The more accurate information it has, the better scheduling decision it can make. In order to do this, Moab contacts each of its resource managers. So important is this information, the calls to the resource managers are actually blocking. Moab will do nothing else until it contacts the resource managers or a predefined timeout is reached. It’s that important.

Step 2. Manage Workload

In this next step Moab decides which jobs to start, in what order and where they are placed. Essentially, this is the core part of scheduling.

Job Ordering

Jobs in Moab are ordered according to their priority. System administrators are able to determine which factors are taken into account when this priority is calculated. Additionally, different weights can be applied to these factors, which allows administrators to target specific factors as being more important than others.

This priority factors utilize a two-tier system, where all factors (sub-components) are grouped into categories (components). Weights can be applied to both tiers. The following is a table of the different components and sub-components available. See the documentation for specifics.

Components Sub-Components
Job Credentials User, Group, Account, QoS, Class
Fairshare Usage FSUser, FSGroup, FSAccount, FSQoS, FSClass, FSGUser, FSGGroup, FSGAccount, FSJPU, FSPPU, FSPSPU, WCAccuracy
Requested Job Resources Node, Proc, Mem, Swap, Disk, PS, PE, Walltime
Current Service Levels QueueTime, XFactor, Bypass, StartCount, Deadline, SPViolation, UserPrio
Target Service Levels TargetQueueTime, TargetXFactor
Consumed Resources Consumed, Remaining, Percent, ExecutionTime
Job Attributes AttrAttr, AttrState, AttrGres

Each time through the scheduling cycle, Moab will use the values of the selected components, sub-components and weights to calculate a numeric priority value for each one of the jobs. The following is the mathematical function used:

Σ (component-weight) * (sub-component-weight) * (sub-component-value)

Moab then orders the jobs based on their calculated priority score from highest to lowest. This is the order it will then attempt to start the jobs. It starts at the top of the list and moves down starting each job. Once it gets to a job that can’t currently be started, a reservation is created for that job. Moab then switches to backfill mode and continues working its way through the list.

For jobs Moab decides to start, it then needs to decide where to place them.

Job Placement

Job placement is determined through a simple process of two filters and a sort. Basically, it does the following:

  1. Start with all nodes in the cluster.
  2. Filter 1 – Geometry Check: All nodes that cannot physically run the job are removed from consideration (e.g., too few cores or memory).
  3. Filter 2 – Policy Check: All nodes that cannot run the job because of policy are removed from consideration (e.g., reservations or other running jobs).
  4. Sort: Using the Node Allocation Policy, the nodes are sorted and the top results are chosen.
  5. Job sent to selected nodes.

This process is then repeated for each job that can be started.

Step 3. Refresh Reservations

Following the workload management, Moab takes a look at each reserveration in the system. They are updated as necessary.

Step 4. Update Statistics

Moab keeps some statistics internally, and these are updated at this point in the schedule.

Step 5. Handle User Requests

Finally, in the final step of the scheduling iteration, Moab waits for and handles user requests. These include blocking commandline programs and interactions with other external systems.

In the case where Moab does not have enough time to accomplish everything within the specified poll interval, it is this final step that will be sacrificed. In other words, Moab will ensure the first four steps are always completed, while this last one is optional. This has the effect of user commands timing out as a problem starts to develop. This is often noticed by users of the system, which gives the administrators a chance to look at and correct the problem before it starts to affect Moab’s number one priority: the scheduling of workload.

Conclusion

Hopefully this has been a useful overview of Moab’s scheduling iteration.

In a future blog post, we will continue to explore some of the ways one can use the scheduling cycle to understand what is happening with the scheduler and ways to diagnose problems.

Series NavigationUnderstanding Moab Scheduling: Part II >>
Facebook Twitter Email