Using Moab Scheduling to Limit Downtime by Creating a Rolling Update Strategy for Applications and Libraries

Rolling UpdatesMany times we get asked the question, “What are the best practices for updating applications or libraries on the compute nodes of my cluster?” Typically, a customer’s plan involves scheduling downtime on nodes through a complex set of reservations, draining the nodes, and then manually updating the desired software. Using these reservations, Moab makes sure a set of nodes are idle, allows the administrator to update the applications as needed, and then manually releases the reservation. From there, the administrator moves on to the next set of nodes.

This is an OK way of doing it, but it’s not efficient or simple, and it adds a lot of burden on the administrator. Luckily, there is a little known feature in Torque and Moab that greatly
helps automate this process:

To easily initiate a rolling upgrade of your application/library, take advantage of the varattr parameter in Torque. A varattr is similar to the widely used Moab Node Feature or Torque Node Property component. The Node Feature option in Moab is a static parameter. This option allows a user to submit a job and request a feature, which directs the job to specific types of nodes. For example, maybe a node has a special feature called “Fast”, for nodes that are faster than the normal nodes. If a user requests the “Fast” feature, th
en the job will only run on a “Fast” node. In Torque, Node Features are called Node Properties. If set in Torque, the properties are imported into Moab as a Node Feature.

A varattr is also a way of directing jobs to sets of nodes based on special attributes assigned to a node. What differentiates it from a Node Feature is that the varattr dynamically changes which attributes are set. It is configured in the Torque MOM config file, and it executes a script at a configurable interval time period. The output of the script defines the attribute.

 

How does varattr fit into a rolling application/library upgrade strategy? Let’s walk through the solution. The documentation for varattr, which provides another example, is found here:

http://docs.adaptivecomputing.com/9-0-0/enterprise/MWM/help.htm#topics/moabWorkloadManager/topics/resourceManagers/rmextensions.html#dynamicFeatures

  1. Create a script that query’s the version of application/library, and configure the varattr in each MOM’s config file
    • For example, we query Matlab every 30 seconds to discover its version.
    • $varattr 30 /var/spool/torque/matlab_version.py
    • The script outputs the version of Matlab:  “matlab=8.1”
    • The value of varattr, appears in pbsnodes –a and checknode
  2.  When a job is submitted requesting the attribute, the job will then go to a node where that attribute is set.
    • > qsub –l reqattr=matlab=8.1 matlab_update.sh
  3.  To do the “rolling” upgrade of Matlab, submit a separate job for every node in the system.
    • If we have 500 nodes, we submit 500 jobs.

Each job requires three main things. First, it will request the 8.1 matlab dynamic attribute. Second, the job requests the whole node. This guarantees that only one job will run on a node at once so it won’t affect other jobs running on the same node. Third, the job itself will execute whatever process is needed to update the application/library.

When the application/library is updated and the varattr script executes, it now picks up the new version of matlab, 8.2 instead of 8.1 – “matlab=8.2”. With the “update job” in a complete state, the node is free to wait for new jobs. Moab will not schedule another “matlab=8.1” attribute job to that same node because it now is reporting to Moab as being “matlab=8.2”. Over time, as these update jobs are scheduled and run, the nodes are updated and all nodes then return the “matlab=8.2” attribute.

We recommend submitting all of the “update jobs” to a special Quality of Service credential to group the “update jobs” together. Through a QoS, you can give the “update jobs“ higher priority so they push through the system faster, limit how many nodes are updating at once, or set a reservation for nights or the weekends. Reservations will restrict when “update jobs” are allowed to run and minimizes downtime during busy periods of the day/week.

Varattr is one of many Moab/Torque policies that sometimes go overlooked among the thousands of parameters.  However, administrators can make life easier by using it to update their environment while also limiting downtime, thereby providing a better-utilized environment. I hope this is found useful and opens eyes to other use cases for varrattr and how it could be applied in your environment.

This approach is presented as a foundation for your update strategy, of which there are many possible variations. Combined with the other capabilities of Moab, this strategy can help you accomplish what is needed to successfully update your environment. If you have more questions on how to implement this in your environment, please contact us at [email protected]

Facebook Twitter Email

Speak Your Mind

*