Support User Manuals

IBM 755 Server User Manual

Open as PDF

of 47

IBM United States Hardware Announcement

110-008

IBM is a registered trademark of International Business Machines Corporation

6

Memory error correction extensions

The memory has single-bit-error correction and double-bit-error detection ECC

circuitry. The ECC code is also designed such that the failure of any one specific

memory module within an ECC word by itself can be corrected absent any other

fault.

Memory protection features include scrubbing to detect errors, a means to call for

the deallocation of memory pages for a pattern of correctable errors detected, and

signaling deallocation of a logical memory block when an error occurs that cannot be

corrected by the ECC code.

Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in

one of its caches, other events can occur, and they need to be distinguished from

one another. For caches and their directories, hardware and firmware keep track

of whether errors are being corrected beyond a threshold. If exceeded, a deferred

repair error log is created.

Caches and directories on the POWER7 chip are manufactured with spare bits

in their arrays that can be accessed via programmable steering logic to replace

faulty bits in the respective arrays. This is analogous to the redundant bit steering

employed in main storage as a mechanism that is designed to help avoid physical

repair, and is also implemented in POWER7 systems. The steering logic is activated

during processor initialization and is initiated by the built-in system-test (BIST) at

power-on time.

When correctable error cache exceeds a set threshold, systems using the POWER7

processor invoke a dynamic cache line delete function, which enables them to stop

using bad cache and eliminates exposure to greater problems.

Fault monitoring functions

• When a POWER7-based system is powered on, BIST and POST (power-on self-

test) check processor, cache, memory, and associated hardware required for

proper booting of the operating system. If a noncritical error is detected or if the

errors occur in resources that can be removed from the system configuration, the

restarting process is designed to proceed to completion. The errors are logged in

the system nonvolatile RAM (NVRAM).

• Disk drive fault tracking is designed to alert the system administrator of an

impending disk drive failure before it impacts customer operation.

Mutual surveillance

The Service Processor monitors the operation of the firmware during the boot

process, and also monitors the Hypervisor

TM

for termination. The Hypervisor

monitors the Service Processor and will perform a reset/reload if it detects the loss

of the Service Processor. If the reset/reload does not correct the problem with the

Service Processor, the Hypervisor will notify the operating system and the operating

system can take appropriate action, including calling for service.

Environmental monitoring functions

POWER7-based servers include a range of environmental monitoring functions:

• Temperature monitoring warns the system administrator of potential

environmental-related problems by monitoring the air inlet temperature. When

the inlet temperature rises above a warning threshold, the system initiates an

orderly shutdown. When the temperature exceeds the critical level, or if the

temperature remains above the warning level for too long, the system will shut

down immediately.

• Fan speed is controlled by monitoring actual temperatures on critical components

and adjusting accordingly. If internal component temperatures reach critical

levels, the system will shut down immediately, regardless of fan speed. When a

previous next