TORQUE Protocol 4: Inter-Server Communication

This entry is part 4 of 4 in the series TORQUE Protocols

The TORQUE inter-server or IS protocol is for inter-server communication. “What do you mean by inter-server?” you might say. “Isn’t there only one server in the TORQUE cluster?” Like so many other things in the computer art the term server can have an ambiguous meaning. In the case of the IS Protocol a server is the pbs_server and all of the MOMs. Yes a MOM can be a server too. (Just ask your mom. She will verify moms are servers too). There are only three commands you will see for the IS protocol. Those commands are IS_CLUSTER_ADDRS (2), IS_UPDATE (3), and IS_STATUS (4).

CommunicationThe IS_CLUSTER_ADDRS command is issued by pbs_server to all of the MOMs in the cluster at startup time. This command is sent to the MOMs port 15003. In this command pbs_server sends the names of all of the other MOMs that are in the cluster. Each MOM then adds the name and address to a trusted client list. The MOM then uses this list to verify incoming communication. If a request comes to a MOM and the address is not in the trusted list the request is rejected.

Historically, the IS_CLUSTER_ADDRS command is sent out only when pbs_server starts up or if a MOM requests a cluster update. But in up coming releases of TORQUE 5 you may see this command more often as Moab 8 and TORQUE team up to allow for dynamic creation and deletion of nodes. TORQUE has long had the ability to add nodes on the fly but the process was clunky and required a reboot of pbs_server in order to get the change in nodes out to all of the MOMs. In the next release of Moab and TORQUE there will be the ability to add and delete nodes and also update the cluster automatically.

The IS_UPDATE command is used by the MOM to tell pbs_server the state of the MOM if it changes while running. Some states nodes can have are free, offline, down, job_exclusive, and even state-unknown. When states change on the MOM it can use the IS_UPDATE command to let pbs_server know.

The final command IS_STATUS sends all of the information about a MOM to pbs_server. Each MOM by default will update its status every 45 seconds. This value can be tuned by using the $status_update_time parameter in the mom config file. The following screen shot shows the header and first part of the information sent from the MOM to the server. This is the same information that is displayed in pbsnodes output.

A status update packet sent from a MOM to pbs_server

A status update packet sent from a MOM to pbs_server

If you click on the image above it will open and give a clear picture of the content. The highlighted area is the data for the packet starts with +4+3+45+150025+15003+8node=kmn… The + symbols are part of the DIS protocol. The 4+3+4 indicates that this is the IS_PROTOCOL (4), Version 3 and the final 4 is the IS_STATUS command.  15002 and 15003 are the ports that this MOM is listening on and the node=kmn is the node name for this MOM. With TORQUE 4.x and later sometimes there are problems getting MOMs up in a cluster because of naming. Starting in TORQUE 4 blanket acceptance of short names was dropped. That is to say the name presented by the MOMs has to match exactly the name in the server_priv/nodes file of pbs_server in order for a node to be accepted into the cluster. In 2.5.x and earlier if a short name could be matched from a canonical name the name was accepted. For instance, if a node had an entry in the nodes file of “ np=16″ and the MOM sent node=kmn in its status update in 2.5.x and earlier the node would be accepted because the first part of the host name could be matched. However, it is rejected in 4.x and later. To fix this problem you can either change the nodes file or you can start pbs_mom with a -A option and give the canonical name ( as the argument. The point to remember is that what the pbs_mom reports as its name in the status update needs to match what is in the pbs_server nodes file.

That is it. The IS protocol is short and sweet. The IS_STATUS command is also one of the most used commands in a cluster because it is what the MOMs use to keep pbs_server up to date. When a node is having problems make sure the status update is getting to the server. Also make sure that the node name the MOM sends in its status update matches the name on the pbs_server nodes file. If all of these things are happening then your cluster will be running more like you expect.

Series Navigation<< TORQUE Protocol 2 – PBS Batch Protocol
Facebook Twitter Email

Speak Your Mind