Support User Manuals

IBM 750 Server User Manual

Open as PDF

of 86

IBM United States Hardware Announcement

110-009

IBM is a registered trademark of International Business Machines Corporation

15

integration testing processes. During system manufacturing, systems go through a

thorough testing process to help ensure the highest level of product quality.

The system cache and memory offer ECC (error checking and correcting) fault-

tolerant features. ECC is designed to correct environmentally induced, single-bit,

intermittent memory failures and single-bit hard failures. With ECC, the likelihood of

memory failures will be substantially reduced. ECC also provides double-bit memory

error detection that helps protect data in the event of a double-bit memory failure.

The AIX and IBM i operating systems provide disk drive mirroring and disk drive

controller duplexing. The Linux operating system supports disk drive mirroring

(RAID 1) through software, while other RAID protection schemes are provided via

hardware RAID adapters.

The Journaled File System, also known as JFS or JFS2, helps maintain file system

consistency and reduces the likelihood of data loss when the system is abnormally

halted due to a power failure. JFS, the recommended file system for 32-bit kernels,

now supports extents on the Linux operating system. This feature is designed

to substantially reduce or eliminate fragmentation. Its successor, JFS2, is the

recommended file system for 64-bit kernels.

With 64-bit addressing, a maximum file system size of 32 TB and maximum file

size of 16 TB, JFS2 is highly recommended for systems running the AIX operating

system.

Memory error correction extensions

The memory has single-bit-error correction and double-bit-error detection ECC

circuitry. The ECC code is also designed such that the failure of any one specific

memory module within an ECC word by itself can be corrected absent any other

fault.

Memory protection features include scrubbing to detect errors, a means to call for

the deallocation of memory pages for a pattern of correctable errors detected, and

signaling deallocation of a logical memory block when an error occurs that cannot be

corrected by the ECC code.

Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in

one of its caches, other events can occur, and they need to be distinguished from

one another. For caches and their directories, hardware and firmware keep track

of whether errors are being corrected beyond a threshold. If exceeded, a deferred

repair error log is created.

Caches and directories on the POWER7 chip are manufactured with spare bits

in their arrays that can be accessed via programmable steering logic to replace

faulty bits in the respective arrays. This is analogous to the redundant bit steering

employed in main storage as a mechanism that is designed to help avoid physical

repair, and is also implemented in POWER7 systems. The steering logic is activated

during processor initialization and is initiated by the built-in system-test (BIST) at

power-on time.

When correctable error cache exceeds a set threshold, systems using the POWER7

processor invoke a dynamic cache line delete function, which enables them to stop

using bad cache and eliminates exposure to greater problems.

Fault monitoring functions

• When a POWER7 processor-based system is powered on, BIST and POST (power-

on self-test) check processor, cache, memory, and associated hardware required

for proper booting of the operating system. If a noncritical error is detected or if

the errors occur in resources that can be removed from the system configuration,

the restarting process is designed to proceed to completion. The errors are logged

in the system nonvolatile RAM (NVRAM).

previous next