IBM SG24-5131-00 Laptop User Manual


 
Cluster Troubleshooting 147
7.3.4 Changing the Failure Detection Rate
Use the SMIT Change/Show a Cluster Network Module screen to change the
failure detection rate for your network module
only
if enabling I/O pacing or
extending the syncd frequency did not resolve deadman problems in your
cluster. By changing the failure detection rate to “Slow”, you can extend the
time required before the deadman switch is invoked on a hung node and
before a takeover node detects a node failure and acquires a hung node’s
resources. See the
HACMP for AIX, Version 4.3: Administration Guide,
SC23-4279
for more information and instructions on changing the Failure
Detection Rate.
7.4 Node Isolation and Partitioned Clusters
Node isolation
occurs when all networks connecting nodes fail but the nodes
remain up and running. One or more nodes can then be completely isolated
from the others. A cluster in which this has happened is called a
partitioned
cluster
. A partitioned cluster has two groups of nodes (one or more in each),
neither of which cannot communicate with the other. Let’s consider a two
node cluster where all networks have failed between the two nodes, but each
node remains up and running.
The problem with a partitioned cluster is that each node interprets the
absence of keepalives from its partner to mean that the other node has failed,
and then generates node failure events. Once this occurs, each node
attempts to take over resources from a node that is still active and therefore
still legitimately owns those resources. These attempted takeovers can cause
unpredictable results in the cluster—for example, data corruption due to a
disk being reset.
To guard against a TCP/IP subsystem failure causing node isolation, the
nodes should also be connected by a point-to-point serial network. This
connection reduces the chance of node isolation by allowing the Cluster
Managers to communicate even when all TCP/IP-based networks fail.
It is important to understand that the serial network does not carry TCP/IP
communication between nodes; it only allows nodes to exchange keepalives
I/O pacing must be enabled before completing these procedures; it
regulates the number of I/O data transfers. Also, keep in mind that the
Slow
setting for the Failure Detection Rate is network specific.
Note