Torque 6.0 NUMA Use Cases: Task Placement (Numa Nodes)

This entry is part 2 of 2 in the series Torque 6.0 NUMA Use Cases

2016-gt350The big new feature of the 2016 Ford Mustang Shelby GT350 was not its 5.2 liter, 526 horsepower engine, 429 lb-ft of torque, or its 4.3 second 0-60 time. It was its handling.

Ford finally made a muscle car that could hold its own through the corners. All that power could now be more precisely handled without losing performance.
Torque 6.0 is somewhat like the 2016 Shelby GT350 in that it lets users control more precisely where their jobs will run on the hardware. In November I started a blog about Torque 6.0 use cases. This blog is the second in a series of blogs which show how to use the new -L resource request syntax in Torque 6.0.

A key element of the Torque 6.0 -L resource request is the place value. This value gives users a wide range of options for how to use the available hardware to get the most out of their jobs. It specifies the placement of a single task on the hardware. Specifically, this designates what hardware resource locality level and identifies the quantity of locality-level resources. Placement at a specific locality level is always exclusive, meaning a job task has exclusive use of all logical processor and physical memory resources at the specified level of resource locality, even if it does not use them. The levels of locality are socket, numanode, core, thread and node.

The diagram below shows a dual socket node with 16 cores and two threads per core for a total of 32 processors. As is common with Intel designs there is one numa node per socket. Each numa node has its own memory; however, all memory in the system is usable by any processor in the system. But if a processor has to access memory that does not belong to its numa node then the time to access that memory goes up and reduces performance. This is where the place value helps.

32 core node (1)
Let’s say I have a job which requires two nodes with 6 processes per node. In traditional PBS syntax the request would be qsub -l nodes=2:ppn=6 <job.sh>.  Moab would process this request and likely allocate 12 processors on a single node. You may get processors 0-5 for the first node and then processors 6-11 for the second node. In the diagram below notice that the second node will be running on two separate numa nodes accessing memory from the two numa nodes. Even worse, Moab may pack the job and allocate processors 0-7 and 16-19. This would leave the first four processors of the first node sharing a core with the last four processors of the second node.

32 core node (2)

With the -L syntax we can now be specific about the hardware configuration we want to use. Using the example just used we can submit a job with qsub -L tasks=2:lprocs=6:place=numanode <job.sh>. Moab now knows that there are two tasks for this job and that each task needs 6 processors and a numanode.  Using the diagram below one task of the job is allocated to processors 0-5 on Numa Node 0 and the second task of the job is allocated to processors 8-13 on Numa Node 1. Each task will have its own local memory to use and will not contend with the other task over resources. Furthermore, the place=numanode makes the use of each numanode exclusive so no other jobs will be scheduled to these resources.

32 core node (3)

Sure, users are going to need to modify their scripts a little to take advantage of the better handling provided in Torque 6.0, but the increased control and better performance are more than worth the effort.

 

Series Navigation<< Torque 6.0 NUMA Use Cases: Introduction
Facebook Twitter Email

Speak Your Mind

*