Node Status Values
node has more than one instance of each type of temperature sensor, the maximum of
their values is recorded.
Temperature information is recorded as a list of attribute-value pairs, for example:
ambient=15 cpu=40 psu=20
Note that not all node types support all types of thermistor reading. The environment
field may contain only a subset of this information.
If an error occurs, the environment string contains details of what has failed. For
example, the following string indicates that the CPU fan number 1 has failed on the
node.
cpu fan 1 failure
B.6 Node Status Values
The current state of a node is found in the status and runlevel fields of the nodes
table. State changes are logged in the events table; entries in the events table are
identified by class=node and name=N, where N is the name of the node as entered in
the name field of the nodes table.
Provided a node is configured in (configured field set to 1), the status field contains
one of the values shown in Table B.4.
Table B.4: Node Status Values
Status Description
not responding Machine Manager cannot get response from node
active Node responds to IP requests but RMS is not running
running RMS is active on this node
RMS does not monitor the state of a node while it is configured out (configured=0). It
determines the status again when the node is configured back in.
As a node boots, its status progresses from not responding to active and on to
running. A long delay in reaching the running state indicates a problem with booting
that should be investigated further. If a node changes from the running state to the
active state and stays there then there is a problem with RMS on that node. If a node
changes from the running state to the not responding state then either the node has
crashed or IP communications to the node are failing. In this case, RMS runs the
rmsevent_node event handler script. This script attempts to determine what went
wrong.
B-4 RMS Status Values