Support User Manuals

IBM SG24-5131-00 Laptop User Manual

Open as PDF

of 240

Cluster Troubleshooting 145

hang. After a certain amount of time, by default 360 seconds, the cluster

manager will issue a config_too_long message into the /tmp/hacmp.out

file.

The message issued looks like this:

The cluster has been in reconfiguration too long;Something may be wrong.

In most cases, this is because an event script has failed. You can find out

more by analyzing the /tmp/hacmp.out

file.The error messages in the

/var/adm/cluster.log

file may also be helpful. You can then fix the problem

identified in the log file and execute the

clruncmd command on the command

line, or by using the

SMIT Cluster Recovery Aids screen. The clruncmd

command signals the Cluster Manager to resume cluster processing.

Note, however, that sometimes scripts simply take too long, so the message

showing up isn’t always an error, but sometimes a warning. If the message is

issued, that doesn’t necessarily mean that the script failed or never finished.

A script running for more than 360 seconds can still be working on something

and eventually get the job done. Therefore, it is essential to look at the

/tmp/hacmp.out file to find out what is actually happening.

7.3 Deadman Switch

The term “deadman switch” describes the AIX kernel extension that causes a

system panic and dump under certain cluster conditions if it is not reset. The

deadman switch halts a node when it enters a hung state that extends

beyond a certain time limit. This enables another node in the cluster to

acquire the hung node’s resources in an orderly fashion, avoiding possible

contention problems.

If this is happening, and it isn’t obvious why the cluster manager was kept

from resetting this timer counter, for example because some application ran

at a higher priority as the

clstrmgr process, customizations related to

performance problems should be performed in the following order:

1. Tune the system using I/O pacing.

2. Increase the

syncd frequency.

3. If needed, increase the amount of memory available for the

communications subsystem.

4. Change the Failure Detection Rate.

Each of these options is described in the following sections.

previous next