Adding cgroups to Torque 6.0

One of the main advancements for the upcoming release of Torque 6.0 is the addition of cgroups to help with the management of job resources which improves accounting, creates better memory enforcement and makes it easier to partition hardware to get the maximum efficiency for each job. In the latest Linux kernels cgroups are often installed by default and are managing process resource usage without you even knowing it. There are several tools available so users or administrators can configure their systems to put different processes into different cgroups. Again, the point of using a cgroup is to be able to limit or manage how processes access certain resources. For example, cgroups can be configured so users can only use a maximum amount of resident memory, or it can be used to give one group higher bandwidth  to the network while reducing another groups bandwidth. But cgroups can also be used programmatically which is how Torque 6.0 uses cgroups to manage resources on a per job basis.

cgroup subsystems

For a detailed explanation of how cgroups work you can read Red Hats Resource Management Guide. One aspect of cgroups is the concept of subsystems. A subsystem is a representation of a single resource such as cpu time, memory or network io. There are several subsystems managed by cgroups but Torque 6.0 only manages five of these; cpu, cpuacct, cpuset, devices and memory. More subsystems are likely to be used in the future but these are the first to be exploited.

Description of subsystems used by Torque

These are the descriptions of the subsystems used by torque as described by Red Hat’s Resource Management Guide

  • cpu — this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
  • cpuacct — this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
  • cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
  • devices — this subsystem allows or denies access to devices by tasks in a cgroup.
  • memory — this subsystem sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.

With Torque 6.0 the cpu and devices subsystems are not completely implemented. But cpuacct, cpuset and memory are all used to help manage and limit job resource usage. The ability for Torque to manage job resources in part comes from a new resource request syntax. David Beer will be blogging about this new syntax in the coming weeks. With this syntax users can request part or all of the compute and memory resources of a compute node.

For example a user could request all of the compute and memory resources of a numanode which might have 24 Gb of memory and 8 processing units. The cgroups allow Torque to partition off the numanode hardware so no other jobs can get access to the processors or memory. This can improve job performance and predictability since a job does not have to share any processors or memory in the numanode with other jobs. And the job itself does not have to use all of the cpus allocated in the numanode. It only needs to bind the number of cpus needed for the job. This allows the user to bind to non-adjacent cpus which reduces cache memory contention thus increasing performance and improving predictability.

The memory subsystem can be used to limit the amount of resident memory and swap memory a task can use. Limiting resident memory causes any memory used over the limit to be swapped out. Limiting the swap memory will cause the process to terminate if it exceeds the swap memory allocated.

The memory subsystem is also used to collect accounting information on the amount of memory used for a job. Memory usage for resident memory and swap memory are recorded. And now per task memory is also recorded so users can see more precisely where resources are used in a job.

pbs_mom crash no longer means loss of data

The use of cgroups means that a MOM crash no longer means you cannot get the resource usage of running jobs. You still cannot get the exit status but you can get the cpu time used, resident memory used and swap memory used by the job.

Adding cgroups to Torque 6.0 has improved accounting, created better memory enforcement and made it easier to partition hardware to get the maximum efficiency for each job. Look for even more improvements in future versions of Torque as the other management features of cgroups are exploited.



Facebook Twitter Email

Speak Your Mind