Using Moab Job Priorities – Creating a Prioritization Strategy

This entry is part 1 of 3 in the series Using Moab Job Priorities

Just over a year ago, I wrote my first post for this blog. It was (only in my opinion) a quaint little flight of fancy dealing with job prioritization in the world of TRON. Today, I want to be a little more grounded. Let’s talk about how Moab calculates job priorities, or in other words, how Moab interprets job priority configuration to create a prioritization strategy.

The Scheduling Cycle

Moab Scheduling Cycle

Moab Scheduling Cycle

In the recent “Understanding Moab Scheduling” series, I briefly discussed the topic of job prioritization.

Job prioritization happens as part of the second stage in the Moab scheduling cycle. In review, those five stages are:

  • Update Information from Resource Managers
  • Manage Workload
  • Refresh Reservations
  • Update Statistics
  • Handle User Requests

As the first part of the second stage, job priorities are being continually reevaluated. Every scheduling iteration recalculates the priorities of all running and considered eligible jobs. Let’s look at exactly how that happens.

The Mathematics

There are 41 different attributes (known as sub-components) that can be used for calculating job priorities. Only one of these, QueueTime, is enabled by default. So, without any configuration, Moab’s default behavior is that of a FIFO scheduler with backfill enabled. Each of these sub-components have a numerical value that is calculated each scheduling iteration where it is used.

These sub-components are then divided into seven different logical buckets (known as components), as can be seen in the table below:

Components Sub-Components
Job Credentials (CRED) User, Group, Account, QoS, Class
Fairshare Usage (FS) FSUser, FSGroup, FSAccount, FSQoS, FSClass, FSGUser, FSGGroup, FSGAccount, FSJPU, FSPPU, FSPSPU, WCAccuracy
Requested Job Resources (RES) Node, Proc, Mem, Swap, Disk, PS, PE, Walltime
Current Service Levels (SERV) QueueTime, XFactor, Bypass, StartCount, Deadline, SPViolation, UserPrio
Target Service Levels (TARGET) TargetQueueTime, TargetXFactor
Consumed Resources (USAGE) Consumed, Remaining, Percent, ExecutionTime
Job Attributes (ATTR) AttrAttr, AttrState, AttrGres

 

To enable or “turn on” a specific sub-component for job priority calculation, entries must be placed in the moab.cfg file setting the associated component and sub-component weights to be non-zero. For example, let’s say we want to disable QueueTime and replace it with XFactor. Both are part of the Current Service Level component (SERV). So, the entries in moab.cfg would be the following:

serviceweight 1
queuetimeweight 0
xfactorweight 1

 

The calculation of the job priority is done using the following function:

Job Priority Calculation Function Where,

  • Cwn — Component weight for nth attribute
  • Swn — Sub-Component weight for nth attribute
  • Svn — Sub-Component value for nth attribute

 

For each running and eligible job Moab will go through all of the sub-components and add up the Component-Weight x Sub-Component-Weight x Sub-Component-Value sets, resulting in a numeric value capped at 1,000,000,000 (one billion). So, using the above configuration (XFactor only), the job’s priority would be:

Priority = 1 x 1 x XFactorValue

The first 1 comes from serviceweight and the second from xfactorweight.

Using this numeric score, Moab than orders the jobs from the highest number (priority) to the lowest. In other words, a priority score of 1 is probably very low.

A Little Experiment

As this is simply mathematics, it is possible to create very specific, specialized and complex functions for calculating priority. In other words, it gives one very fine-grained control. However, I was curious to see how these were actually being used at our customer sites.

I decided to do a little (non-scientific) experiment.

As part of our support process, customers have the opportunity to upload a “snapshot” of their system configuration. Part of this is the mdiag -p output, which contains their configured Component and Sub-Component weights. Going back through the archive, I was able to identify 119 unique systems, and extracted what they had configured. The results were interesting, and are presented below (ordered by most used to least used):

Component Sub-Component Usage % Using
SERV QUEUETIME 110 92.44%
CRED CLASS 62 52.10%
CRED QOS 53 44.54%
FS FSUSER 46 38.66%
SERV XFACTOR 29 24.37%
CRED USER 28 23.53%
RES PROC 23 19.33%
FS FSGROUP 21 17.65%
FS FSACCOUNT 19 15.97%
FS FSCLASS 13 10.92%
CRED GROUP 12 10.08%
CRED ACCOUNT 9 7.56%
SERV USERPRIO 9 7.56%
FS FSQOS 7 5.88%
RES NODE 7 5.88%
TARGET TARGETQUEUETIME 5 4.20%
ATTR ATTRATTR 5 4.20%
FS WCACCURACY 4 3.36%
FS FSPPU 3 2.52%
RES MEM 3 2.52%
SERV BYPASS 2 1.68%
Component Sub-Component Usage % Using
ATTR ATTRSTATE 2 1.68%
FS FSGUSER 1 0.84%
FS FSGGROUP 1 0.84%
FS FSJPU 1 0.84%
RES PE 1 0.84%
FS FSGACCOUNT 0 0.00%
FS FSPSPU 0 0.00%
RES SWAP 0 0.00%
RES DISK 0 0.00%
RES PS 0 0.00%
RES WALLTIME 0 0.00%
SERV STARTCOUNT 0 0.00%
SERV DEADLINE 0 0.00%
SERV SPVIOLATION 0 0.00%
TARGET TARGETXFACTOR 0 0.00%
USAGE CONSUMED 0 0.00%
USAGE REMAINING 0 0.00%
USAGE PERCENT 0 0.00%
USAGE EXECUTIONTIME 0 0.00%
ATTR ATTRGRES 0 0.00%

 

Let’s take a quick look at some of the highest used sub-components from the “survey.” Again, because of the data-gathering approach/technique, this isn’t scientific, but it is interesting. It should also be noted “weights” are not factored into this data, only whether or not the sub-component is being used.

QUEUETIME In my opinion, there really isn’t a whole lot of surprise here. QUEUETIME is the only one of the attributes that is turned on by default. Naturally, this will result in a high position on this list. What is interesting is there are nine systems that have gone out of their way to turn it off. My guess is they, like the example above, have swapped it out for XFACTOR.
CLASS Many traditional HPC schedulers are heavily queue based. As such, I wasn’t too surprised to see CLASS, which is just another name for a queue, to be this high in the list. Many of us in the industry, admins and users alike, are comfortable with thinking about jobs in terms of queues. Moab’s priority system is very flexible, thus allowing this traditional mindset to be readily modeled.
QOS I’m fairly certain this one earned its position in the list for several different reasons, including QoS’s ability to easily modify the priorities given by Classes inside of Moab. End-users use the basic queuing facilities of CLASS and then modify it some way through QOS. It makes sense.
FSUSER Here we have our first Fairshare entry in the list. Fairshare is a great way to softly balance the cluster with different usage targets. I’m not overly surprised the most common approach is to choose User as the credential for doing the balancing.
XFACTOR To be honest, I was a little surprised by the number of sites using XFACTOR. However, in retrospect, as it is similar to QUEUETIME, but favors short (i.e., short wallclock limit) jobs, it does make sense on certain large systems.
USER Again, like with Fairshare, it appears the most common credential on which to base priorities is USER. No surprises here.
PROC Finally, we have PROC. This is almost certainly being used to favor large (i.e., many processor) jobs. The idea here is that if one places the large jobs first, it leaves open spaces/holes that can then be backfilled by the smaller jobs. It is similar to Stephen R. Covey‘s “The Big Rocks of Life” analogy.

 

We could go on through the list, but I think that’s sufficient for now.

What’s Next?

This is the first in a three part series on the Moab’s job prioritization. Part II will deal with using mdiag -p and understanding its output. Part III will then cover some of the less-used and less-understood sub-components in more detail.

Until next time…

Series NavigationUsing Moab Job Priorities – Understanding mdiag -p Output >>
Facebook Twitter Email

Speak Your Mind

*