IBM P5 570 Server User Manual


 
Chapter 3. Capacity on Demand, RAS, and manageability 57
operating system has lost control. Mutual surveillance also enables the operating system to
monitor for service processor activity and can request a service processor repair action if
necessary.
Environmental monitoring
Environmental monitoring related to power, fans, and temperature is performed by the
System Power Control Network (SPCN). Environmental critical and non-critical conditions
generate Early Power-Off Warning (EPOW) events. Critical events (for example, a Class 5 AC
power loss) trigger appropriate signals from hardware to affected components so as to
prevent any data loss without operating-system or firmware involvement. Non-critical
environmental events are logged and reported using Event Scan.
The operating system cannot program or access the temperature threshold using the SP.
EPOW events can trigger the following actions:
Temperature monitoring, which increases the fan’s speed rotation when ambient
temperature is above a preset operating range.
Temperature monitoring warns the system administrator of potential environmental-related
problems. It also performs an orderly system shutdown when the operating temperature
exceeds a critical level.
Voltage monitoring provides warning and an orderly system shutdown when the voltage is
out of operational specification.
3.2.4 Self-healing
For a system to be self-healing, it must be able to recover from a failing component by first
detecting and isolating the failed component, taking it offline, fixing or isolating it, and
reintroducing the fixed or replacement component into service without any application
disruption. Examples include:
Bit steering to redundant memory in the event of a failed memory module to keep the
server operational
Bit-scattering, thus allowing for error correction and continued operation in the presence
of a complete chip failure (
Chipkill™ recovery)
Single-bit error correction using ECC without reaching error thresholds for main, L2, and
L3 cache memory
L3 cache line deletes extended from 2 to 10 for additional self-healing
ECC extended to inter-chip connections on fabric and processor bus
Memory scrubbing to help prevent soft-error memory faults
Dynamic processor deallocation, in which a deallocated processor can be replaced by an
unused CoD processor to keep the system operational
Memory reliability, fault tolerance, and integrity
The p5-570 uses Error Checking and Correcting (ECC) circuitry for system memory to correct
single-bit memory failures and to detect double-bit. Detection of double-bit memory failures
helps maintain data integrity. Furthermore, the memory chips are organized such that the
failure of any specific memory module only affects a single bit within a four-bit ECC word
(
bit-scattering), thus allowing for error correction and continued operation in the presence of
a complete chip failure (
Chipkill recovery). The memory DIMMs also utilize memory
scrubbing
and thresholding to determine when spare memory modules within each bank of
memory should be used to replace ones that have exceeded their threshold of error count