Support User Manuals

IBM 710 Server User Manual

Open as PDF

of 45

IBM Europe, Middle East, and Africa Hardware

Announcement ZG10-0214

IBM is a registered trademark of International Business Machines Corporation

12

The system cache and memory offer ECC (error checking and correcting) fault-

tolerant features. ECC is designed to correct environmentally induced, single-bit,

intermittent memory failures and single-bit hard failures. With ECC, the likelihood

of memory failures will be reduced. ECC also provides double-bit memory error

detection that helps protect data in the event of a double-bit memory failure.

The AIX and IBM i operating systems provide disk drive mirroring and disk drive

controller duplexing. The Linux operating system supports disk drive mirroring

(RAID 1) through software, while other RAID protection schemes are provided via

hardware RAID adapters.

The Journaled File System, also known as JFS or JFS2, helps maintain file system

consistency and reduces the likelihood of data loss when the system is abnormally

halted due to a power failure. JFS, the recommended file system for 32-bit kernels,

now supports extents on the Linux operating system. This feature is designed to

reduce or eliminate fragmentation. Its successor, JFS2, is the recommended file

system for 64-bit kernels.

With 64-bit addressing, a maximum file system size of 32 TB and maximum file

size of 16 TB, JFS2 is highly recommended for systems running the AIX operating

system.

Memory error correction extensions

The memory has single-bit-error correction and double-bit-error detection ECC

circuitry. The ECC code is also designed such that the failure of any one specific

memory module within an ECC word by itself can be corrected absent any other

fault.

Memory protection features include scrubbing to detect errors, a means to call for

the deallocation of memory pages for a pattern of correctable errors detected, and

signaling deallocation of a logical memory block when an error occurs that cannot be

corrected by the ECC code.

Fault monitoring functions

• When a POWER7 processor-based system is initially powered on, BIST (built-

in self-test) and POST (power-on self-test) check processor, cache, memory,

and associated hardware required for proper booting of the operating system.

If a noncritical error is detected or if the errors occur in resources that can be

removed from the system configuration, the restarting process is designed to

proceed to completion. The errors are logged in the system nonvolatile RAM

(NVRAM).

• Disk drive fault tracking is designed to alert the system administrator of an

impending disk drive failure before it impacts customer operation.

Mutual surveillance

The Service Processor monitors the operation of the firmware during the boot

process, and also monitors the Hypervisor

TM

for termination. The Hypervisor

monitors the Service Processor and will perform a reset/reload if it detects the loss

of the Service Processor. If the reset/reload does not correct the problem with the

Service Processor, the Hypervisor will notify the operating system and the operating

system can take appropriate action, including calling for service.

Environmental monitoring functions

POWER7 based servers include a range of environmental monitoring functions:

• Temperature monitoring warns the system administrator of potential

environmental-related problems by monitoring the air inlet temperature. When

the inlet temperature rises above a warning threshold, the system initiates an

orderly shutdown. When the temperature exceeds the critical level or if the

temperature remains above the warning level for too long, the system will shut

down immediately.

previous next