IBM F80 Server User Manual


 
IBM RS/6000 7025 Model F80 Server 13
http://techsupport.services.ibm.com/rs6k/fixes.html
If you have problems downloading the latest maintenance level, ask your IBM
Business Partner or IBM representative.
Investment Protection and Expansion
The following sections discuss how configurations, upgrades, and design features
help you lower your cost of ownership.
High Availability
Reliability of the system is further hardened by using the HACMP clustering
solution available across the entire range of RS/6000 servers. The HACMP
solution exploits redundancy between server resources and provides application
uptime. The Model F80 is available in a high-availability cluster solution package
named the HA-F80. This solution consists of the following components:
Two Model 7025-F80 Enterprise Servers
AIX Version 4.3.3 operating system (unlimited user license), or later
HACMP 4.3.1 cluster software, or later
One 7133-T40 SSA disk subsystem with at least four disk drives
All necessary redundant hardware and cables
This solution is sold at a price lower than the sum of its parts. Ask your IBM
Business Partner or IBM representative for further information.
Reliability, Availability, and Serviceability (RAS) Features
Some RAS features such as redundant power supplies or N+1 hot-plug fans are
already discussed. Additional topics are covered in the following sections.
Error Recovery for Caches and Memory
The RS64 III processor L1 cache, the L2 cache, system busses, and the memory
are protected by error correction code (ECC) logic. The ECC codes provide single
bit error correction and double bit error detection for the L2 cache and the
memory. All recovered error events are reported by an attention interrupt to the
service processor, where they are monitored for threshold conditions.
The standard memory card has single error-correct and double-error detect ECC
circuitry to correct single-bit memory failures. The double-bit detection helps
maintain data integrity by detecting and reporting multiple errors beyond what the
ECC circuitry can correct. In many cases (using DIMMs with 18 DRAM chips and
when memory is configured in quads, for example), memory chips are organized
such that the failure of any specific memory module only affects a single bit within
an ECC word (bit scattering) thus allowing for error correction and continued
operation in the presence of a complete chip failure (chip kill recovery).
Another function, named
memory scrubbing
, provides a built-in hardware
function, which performs continuous background reads of data from memory,
checking for correctable errors. Correctable errors are corrected and rewritten to
memory, and a threshold counter is maintained that will signal the service
processor with a special attention when the threshold is exceeded.