TORQUE Resource Manager 4.2.6 Release Notes
The release notes file contains the following sections:
The following is a summary of key new features in TORQUE 4.2.6.
Cray ALPS Basil 1.3 protocol support
Support for Cray ALPS Basil protocol has been added to TORQUE.
Adding features to Cray compute nodes
The ability to add features to Cray compute nodes has been implemented in 4.2.6.
This section contains differences in previously existing features that require a change in configuration or routine.
pbs_mom can handle two naming conventions for cpuset files
pbs_mom can handle cpuset files with either of the following naming conventions: those with the
cpuset. prefix and those without it.
The qstat command has been significantly refactored. Many error codes are different from what they were in the past. You can check the new error codes against
pbs_error.db.h for descriptions of their meanings.
trqauthd has been improved in the three following ways:
It can now be terminated by running trqauthd -d.
It remembers which is the active server in HA mode.
It has the ability to retry actions and thereby decrease failures.
The following is a list of some key bugs fixed in TORQUE 4.2.6. Following each issue description is an associated issue number in parentheses.
- pbs_user used popen to send mail using the email addresses specified on the command line, which posed a security risk. TORQUE no longer allows you to run root commands in the email portion of qsub (TRQ-2310). CVE 2013-4495
- pbs_sched did not return the correct syntax for the RM protocol. pbs_sched now works as expected (TRQ-2318).
- Client command failure handling frequently produced errors. Client commands have been made more robust and the failure rate reduced (TRQ-2268).
- MOMs leaked large amounts of memory. These large memory leaks no longer occur (TRQ-2253).
- trqauthd -d did not verify identity, allowing any user to terminate it. Unprivileged users can no longer terminate trqauthd (TRQ-2250).
- TORQUE did not kill prologue scripts after the hard-coded 5-minute timeout. Prologue scripts will now timeout to less than 5 minutes (TRQ-2273).
- In rare cases, TORQUE would delete jobs without freeing their resources. TORQUE frees jobs' resources when the jobs are deleted (TRQ-2111).
- Running qhold on a BLCR job completed the job rather than holding it. pbs_mom no longer uses trqauthd when it checkpoints a job, resolving the qhold problem (TRQ-2208).
- For multi-node jobs TORQUE gave inflated memory stats to Moab. vmem is no longer being stored with mem (and vice versa) to correct the problem (TRQ-2259).
- Cray features were not written with the nodes file. Save properties are added to Cray compute nodes in the nodes file if it is overwritten by pbs_server (TRQ-2280).
- Some jobs did not progress from the OBIT state, becoming stuck in the MOM login. Jobs now complete when expected (TRQ-2333).
- TORQUE did not follow child processes that had changed their session IDs or record the resource usage, resulting in TORQUE's reporting the incorrect memory usage for jobs. TORQUE now reports the correct memory usage of its jobs (TRQ-2321).
- The stdout and stderr files were not deleted from
$TORQUE_HOME/spool after being copied to the directory from which the job was submitted. The stderr and stdout files are automatically removed from the
/spool directory unless the job is purged manually (TRQ-2317).
- Job queues disappeared after TORQUE restart. Queues no longer disappear after restarting TORQUE (TRQ-2289).
- trqauthd did not perceive which was the active server in a high availability environment and did not switch to the inactive server as needed. trqauthd now switches to the inactive server when the active one fails (TRQ-2265).
- A client could close a connection early and cause trqauthd to terminate. When a client closes a connection early, trqauthd continues to run (TRQ-2252).
- The TORQUE server would crash on an invalid string. TORQUE validates strings to prevent the crashes from occurring (TRQ-2244).
- Client commands would sometimes cause a deadlock. These crashes no longer occur (TRQ-2337).
- TORQUE would not honor jobs with -j -o -e in the job script when FSISREMOTE was enabled in Moab. These jobs are now processed correctly, the -j taking precedence over oe and eo (TRQ-2234).
- When pbs_server could not find the connection from the client in the connection table before trqauthd sent the credentials, TORQUE returned an "invalid credentials" error message. TORQUE now returns a more accurate error message that says "Client connection not found. Please retry the command." in this scenario (TRQ-2198).
- When several qdel all commands were run consecutively, the qstat -Q output returned a negative job number. qstat -Q now returns the correct number of jobs (TRQ-2187, TRQ-2007).
- pbs_server crashed when a job in a long dependency chain was deleted. Deleting a job in a long dependency chain now causes TORQUE to delete all consecutive jobs and qstat to return the deleted job, and any jobs before it, as completed (TRQ-2169).
- Deleting jobs from a node that was down caused the server to hang. The server no longer hangs when jobs are deleted from a node that is down (TRQ-2138).
- TORQUE did not track how much memory was committed to other jobs. TORQUE now keeps track of how much memory is already allocated (TRQ-2124).
- trqauthd could not authenticate users due to intermittent LDAP failures. trqauthd now retries to retrieve user credentials from the system (TRQ-2070).
- The cpuset reading on MOMs would fail due to incompatibility with the newer Linux kernels' file structure. TORQUE has been updated to work well with the new Linux kernels (TRQ-2022).
- Jobs did not start promptly. The time it takes to start a job has decreased substantially.
- Unit test coverage of TORQUE has been increased in 4.2.6.
The online help for TORQUE Resource Manager 4.2.6 is available in HTML and PDF format on the Adaptive Computing Documentation page.
© Copyright 2013, Adaptive Computing Enterprises, Inc.