Support User Manuals

IBM F80 Server User Manual

Open as PDF

of 24

IBM RS/6000 7025 Model F80 Server 13

http://techsupport.services.ibm.com/rs6k/fixes.html

If you have problems downloading the latest maintenance level, ask your IBM

Business Partner or IBM representative.

Investment Protection and Expansion

The following sections discuss how configurations, upgrades, and design features

help you lower your cost of ownership.

High Availability

Reliability of the system is further hardened by using the HACMP clustering

solution available across the entire range of RS/6000 servers. The HACMP

solution exploits redundancy between server resources and provides application

uptime. The Model F80 is available in a high-availability cluster solution package

named the HA-F80. This solution consists of the following components:

• Two Model 7025-F80 Enterprise Servers

• AIX Version 4.3.3 operating system (unlimited user license), or later

• HACMP 4.3.1 cluster software, or later

• One 7133-T40 SSA disk subsystem with at least four disk drives

• All necessary redundant hardware and cables

This solution is sold at a price lower than the sum of its parts. Ask your IBM

Business Partner or IBM representative for further information.

Reliability, Availability, and Serviceability (RAS) Features

Some RAS features such as redundant power supplies or N+1 hot-plug fans are

already discussed. Additional topics are covered in the following sections.

Error Recovery for Caches and Memory

The RS64 III processor L1 cache, the L2 cache, system busses, and the memory

are protected by error correction code (ECC) logic. The ECC codes provide single

bit error correction and double bit error detection for the L2 cache and the

memory. All recovered error events are reported by an attention interrupt to the

service processor, where they are monitored for threshold conditions.

The standard memory card has single error-correct and double-error detect ECC

circuitry to correct single-bit memory failures. The double-bit detection helps

maintain data integrity by detecting and reporting multiple errors beyond what the

ECC circuitry can correct. In many cases (using DIMMs with 18 DRAM chips and

when memory is configured in quads, for example), memory chips are organized

such that the failure of any specific memory module only affects a single bit within

an ECC word (bit scattering) thus allowing for error correction and continued

operation in the presence of a complete chip failure (chip kill recovery).

Another function, named

memory scrubbing

, provides a built-in hardware

function, which performs continuous background reads of data from memory,

checking for correctable errors. Correctable errors are corrected and rewritten to

memory, and a threshold counter is maintained that will signal the service

processor with a special attention when the threshold is exceeded.

previous next