IBM SG24-5131-00 Laptop User Manual


 
Cluster Troubleshooting 145
hang. After a certain amount of time, by default 360 seconds, the cluster
manager will issue a config_too_long message into the /tmp/hacmp.out
file.
The message issued looks like this:
The cluster has been in reconfiguration too long;Something may be wrong.
In most cases, this is because an event script has failed. You can find out
more by analyzing the /tmp/hacmp.out
file.The error messages in the
/var/adm/cluster.log
file may also be helpful. You can then fix the problem
identified in the log file and execute the
clruncmd command on the command
line, or by using the
SMIT Cluster Recovery Aids screen. The clruncmd
command signals the Cluster Manager to resume cluster processing.
Note, however, that sometimes scripts simply take too long, so the message
showing up isn’t always an error, but sometimes a warning. If the message is
issued, that doesn’t necessarily mean that the script failed or never finished.
A script running for more than 360 seconds can still be working on something
and eventually get the job done. Therefore, it is essential to look at the
/tmp/hacmp.out file to find out what is actually happening.
7.3 Deadman Switch
The term “deadman switch” describes the AIX kernel extension that causes a
system panic and dump under certain cluster conditions if it is not reset. The
deadman switch halts a node when it enters a hung state that extends
beyond a certain time limit. This enables another node in the cluster to
acquire the hung node’s resources in an orderly fashion, avoiding possible
contention problems.
If this is happening, and it isn’t obvious why the cluster manager was kept
from resetting this timer counter, for example because some application ran
at a higher priority as the
clstrmgr process, customizations related to
performance problems should be performed in the following order:
1. Tune the system using I/O pacing.
2. Increase the
syncd frequency.
3. If needed, increase the amount of memory available for the
communications subsystem.
4. Change the Failure Detection Rate.
Each of these options is described in the following sections.