Using Moab Job Priorities – Exploring Priority Sub-Components

This entry is part 3 of 3 in the series Using Moab Job Priorities

In this third and final installment on Moab job prioritization, we are going to explore several job priority sub-components I feel are often overlooked when people are building their job prioritization strategy. However, being a firm believer in the power of simplicity, I by no means suggest one should go and add all of these into their job priority function.

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius—and a lot of courage to move in the opposite direction.
~ E.F. Schumacher

With that being said, I do believe there are some potential gems here that could benefit many HPC administrators, as they can solve specific problems that are not uncommon.

Let’s dive in…

USERPRIO — User-Specified Priority

business team

Administrators are naturally (and understandably) quite wary of allowing users to pick their own priority. Not surprisingly, given the choice, I would add as many priority points to my jobs as I would be allowed. That’s just human nature.

But, what about allowing users to self-lower their job’s priority?

That’s exactly what the USERPRIO sub-component is designed to accomplish—allows users to lower the priority on their own jobs by up to 1024 points. Why?

Well, the thought process behind this is fairly simple. It is expected any given user will have multiple jobs. Not only will they have multiple jobs, but also these jobs will likely not all be of equal importance. There are some jobs more important than others. In other words, users have the desire to be able to self-organize or self-order their jobs.

Normally, there really isn’t an easy way to do this. True, they can space their job submissions out in time if the QUEUETIME sub-component is enabled (which it is by default). But, waiting around on a Friday night to “manually” order their own workload isn’t such a great solution. They most certainly have somewhere else they’d rather be.

Enter USERPRIO. With USERPRIO enabled in the moab.cfg file, users are able to order their jobs by subtracting priority points. As they are able to subtract up to 1024 points, it is important that the priority spread (i.e., highest priority – lowest priority) is sufficiently large that a 1024 point shift isn’t going to be too drastic. For example, if the highest priority of any job in the system is 5000, a shift of 1024 points is massive. If the highest priority is 500,000, than it’s likely much less of an issue.

Users are able to supply their priority modification through the use of the -p switch for msub, as seen below:

> msub -p -100 job_script.sh

 
In this example, the user is submitting their job and requesting that 100 points be permanently subtracted from the job’s priority. It is likely the user has other more pressing jobs they wish to run on the system. However, at the same time, they want to get this one, which they are currently working on, queued and ready to go at some later point.

As one final side note, for those who wish to also grant users the ability to add priority, setting ENABLEPOSUSERPRIORITY to TRUE in the moab.cfg file will allow them to add up to 1023 positive priority points. Happy day for the user!

As always, modifying SERVICEWEIGHT or USERPRIOWEIGHT to be a value other than 1 will change the effective range, as these will be multiplied by the user supplied value.

XFACTOR — Expansion Factor

In many cases, the QUEUETIME sub-component is sufficient for priority related to how long a job has waited in the queue. However, some administrators replace or supplement it with another of the sub-components: XFACTOR.

Designed to better align queue wait times with the run length of the jobs, XFACTOR causes the scheduler to favor shorter jobs (in terms of their declared walltime) for job priority calculation. Traditionally, XFACTOR has been calculated as follows:

Traditional XFactor Function

Unfortunately, this function actually has a few issues:

  • QueueTime – An unmodified “time in queue” metric is subject to certain undesirable user actions, such as queue stuffing.
     
  • ExecutionTime – A job’s true execution time isn’t known until the job completes—far too late to use in a priority calculation.

Consequently, Moab uses a slightly modified function:

Moab XFactor Function

Under this function, the following becomes valid:

  • EffQueueTime – The effective time in the queue is the amount of time where the job was actually eligible to run (i.e., wasn’t blocked by some policy or usage limit). See JOBPRIOACCRUALPOLICY for additional configuration options.
     
  • WallClockLimit – This is the declared run time for the job. Naturally, this can also be gamed, though the configuration of WCVIOLATIONACTION or WCACCURACYWEIGHT can help control users attempting to take advantage of this.

To get an idea of how this works, let’s take a look at a table showing two different jobs, one requesting a run time of 1 hour and the other 4 hours. The value for XFACTOR will then be shown for queue wait times of 1, 2, 4, 8 and 16 hours.

   1 Hour Wait   2 Hour Wait   4 Hour Wait   8 Hour Wait   16 Hour Wait 
 1 Hour Job   1 + (1 / 1) = 2.00   1 + (2 / 1) = 3.00   1 + (4 / 1) = 5.00   1 + (8 / 1) = 9.00   1 + (16 / 1) = 17.00 
 4 Hour Job   1 + (1 / 4) = 1.25   1 + (2 / 4) = 1.50   1 + (4 / 4) = 2.00   1 + (8 / 4) = 3.00   1 + (16 / 4) = 5.00 

 
It can clearly be seen the longer job is not accruing priority through XFACTOR nearly as fast as the shorter job. The difference becomes even more divergent as the wait time increases.

It should be noted XFACTOR basically follows a geometric progression curve. In other words, it can grow unbounded. In some cases, especially with jobs with a very, very short run time, this can cause undesirable behavior. In these cases, there are two other parameters that should be considered.

  • XFACTORCAP – The maximum number of points that XFACTOR sub-component value. can provide to the calculation, which will then be multiplied by the component and sub-component weights.
     
  • XFMINWCLIMIT – The minimum value allowable as the WallClockLimit. If the provided value is less then XFMINWCLIMIT, then XFMINWCLIMIT will be used instead of the user-supplied value.

Both of these parameters allow for better control of the job priority values.

So, XFACTOR can be an effective supplement or replacement for QUEUETIME as the priority sub-component responsible for providing priority based on the wait time in the queue.

WALLTIME — Requested Walltime

As mentioned above, XFACTOR follows a geometric progression curve. This can be beneficial in many cases, but what happens in the case where one only wants to apply priority based on the expected run time of the job independent of how long it has been in the queue? The WALLTIME sub-component is perfect for that situation.

Using this sub-component, one can add or subtract priority points based solely on the expected run time of the job. Because there is no scaling based on queued time, it is easier to use WALLTIME in cases where points are being subtracted instead of being added. So, for example, if an administrator wants to favor “short” jobs, they have two options:

  • Add points to jobs with a short expected run time
  • Remove points from jobs with along expected run time

Either way works just fine. The setup of the other job priorities in the system will inform the choice as to which approach to take.

PE — Requested Processor-Equivalents

HPC jobs are diverse. There are some that are compute intensive, while others are memory hogs. The concept of a processor-equivalent (PE) arose from the need to have some common unit to determine how much of a node was effectively being used. For example, if a job is only using one processor but all the memory on a node, it is effectively using the entire node, even though there are still three unused processors. The approach is to translate each request into a number of processor-equivalents, each representing how much of the overall node would be used if all resources were equally distributed/locked to each individual processor. This is important to Moab, which inherently sees the world through a processor-centric paradigm.

The actual calculation is as follows:

PE = MAX( (ProcsRequestedByJob  / TotalConfiguredProcs ),  
  (MemoryRequestedByJob / TotalConfiguredMemory),  
  (DiskRequestedByJob   / TotalConfiguredDisk  ),  
  (SwapRequestedByJob   / TotalConfiguredSwap  ) ) × TotalConfiguredProcs

 
Any attribute that isn’t in the system will be skipped to avoid a divide-by-zero error.

Let’s do a sample calculation. Let’s assume we have a 100-node cluster. Each node is a single quad-core with 8GB memory. We’ll assume a job comes in requesting 2 processors and 6GB memory:

3 = MAX( (   2 / (100 × 4   )),  
  (6144 / (100 × 8192)),  
  (      Skipped      ),  
  (      Skipped      ) ) × (100 × 4)

 
This makes sense. The job is using ¾ of the memory, so it is using 3 PE (or ¾ of the resources on a four processors node).

From a job priority point of view, PE can be a erudite selection than some of the other sub-components when trying to obtain better system usage through leveraging backfill. The idea basically follows Stephen R. Covey‘s “The Big Rocks of Life” analogy. By placing (via prioritization) large jobs first, the overall system has better utilization because the small jobs can then backfill in around the already placed large ones.

The discussion then is about how one defines largeness in this case. While one can use PROC or NODE for this, PE gives the most well-rounded view of the largeness of a job. It is a good sub-component to consider if one wishes to use this particular strategy to improve utilization.

SPVIOLATION — Soft Policy Violation

An extension of one’s usage limits policy set, SPVIOLATION adds one additional layer of control regarding what to do when a job is in a soft policy violation state. Essentially, if the job is in soft policy violation, this sub-component will have a value of 1 (otherwise 0). With an appropriate sub-component weight (generally negative in value), an administrator can impose a job priority penalty for the duration of the violation.

DEADLINE — Deadline Proximity

This particular sub-component is mainly targeted for use on what I’ll describe as a “closed” system. This means the end-users on the system are not able to directly (i.e., command line or web services) submit jobs to the cluster. Instead jobs are submitted via automated systems into the cluster. The reason for why it’s mainly used in this circumstance should quickly become clear.

If the DEADLINE sub-component is enabled, it allows jobs to have a specified deadline. From a job priority point of view, as this deadline gets closer, this sub-component’s value will increase linearly. Below is an example of submitting a job with a relative deadline (two hours after submission):

> msub -l deadline=2:00:00 job_script.sh

 
The following example is for a job with a deadline of July 20, 2015 at 9:30 AM:

> msub -l deadline=09:30:00_07/20/15 job_script.sh

 
Clearly, if end-users can arbitrarily set a deadline on their jobs, the potential for abuse is there. Hence, the reason why this particular sub-component is generally only used on “closed” systems. Perhaps, you have one.

CONSUMED & REMAINING — Active Resource Usage

For the final entry here, we are going to actually look at two sub-components: CONSUMED and REMAINING. These are both part of the USAGE priority component, which is unique in that it only affects the priority of active, running jobs. One may ask why one should care about priority after the job has started. One word:

Preemption.

If preemption is enabled on the system, Moab takes more into account than just whether or not a job is a preemptor or a preemptee. When Moab considers which job a preemptor will preempt, it compares the priority of the related jobs. The following rules apply:

  • A preemptor can only preempt a job that has a lower priority than itself.
  • When multiple preemptees are available, the one with the lowest priority should be chosen.

These rules can lead to some interesting consequences. One of the most common is to see a job that is nearly complete be preempted instead of a job that has just started. By following the above rules, this behavior is expected, but it may not be desirable. The CONSUMED and REMAINING priority sub-components are part of a set that when used with different weights can cause a job to become less desirable as a preemption target as it nears completion. The whole set of usage-based job priority sub-components is worth a quick review.

Tying Everything Together

There are many job priority sub-components. Each is designed to adjust job priorities in order to meet different potential organizational goals. The best approach is to pick only a hand full that makes the most sense for what your organization is trying to accomplish. These should then be ranked according to importance, with the weights then being set appropriately.

Hopefully, this post and its series on Moab job priorities has provided you with some deeper insight into how you can fine-tune your cluster. Take some time to review your job prioritization strategy and then the tools that are available to help you accomplish it.

Good luck, and happy computing!

Series Navigation<< Using Moab Job Priorities – Understanding mdiag -p Output
Facebook Twitter Email