IBM 755 Server User Manual


 
IBM United States Hardware Announcement
110-008
IBM is a registered trademark of International Business Machines Corporation
7
redundant fan fails, the system calls out the failing fan and continues running.
When a nonredundant fan fails, the system shuts down immediately.
Availability enhancement functions
The POWER7 family of systems continues to offer and introduce significant
enhancements designed to increase system availability.
POWER7 processor functions
As in POWER6
TM
, the POWER7 processor has the ability to do processor instruction
retry and alternate processor recovery for a number of core-related faults. This
significantly reduces exposure to both hard (logic) and soft (transient) errors in
the processor core. Soft failures in the processor core are transient (intermittent)
errors, often due to cosmic rays or other sources of radiation, and generally are
not repeatable. With this function, when an error is encountered in the core, the
POWER7 processor will first automatically retry the instruction. If the source of the
error was truly transient, the instruction will succeed and the system will continue as
before. On IBM systems prior to POWER6, this error would have caused a checkstop.
Hard failures are more difficult, being true logical errors that will be replicated
each time the instruction is repeated. Retrying the instruction will not help in this
situation because the instruction will continue to fail. As in POWER6, POWER7
processors have the ability to extract the failing instruction from the faulty core
and retry it elsewhere in the system for a number of faults, after which the failing
core is dynamically deconfigured and called out for replacement. The entire process
is transparent to the partition owning the failing instruction. These systems are
designed to avoid a full system outage.
POWER7 single processor checkstopping
As in POWER6, POWER7 provides single processor checkstopping. This significantly
reduces the probability of any one processor affecting total system availability.
Partition availability priority
Also available is the ability to assign availability priorities to partitions. If an
alternate processor recovery event requires spare processor resources in order
to protect a workload, when no other means of obtaining the spare resources is
available, the system will determine which partition has the lowest priority and
attempt to claim the needed resource. On a properly configured POWER7 processor-
based server, this allows that capacity to be first obtained from, for example, a test
partition instead of a financial accounting system.
POWER7 cache availability
The POWER® processor-based line of servers continues to be at the forefront of
cache availability enhancements. The L3 cache is now integrated on the POWER7
processor. The POWER7 processor provides both L2 and L3 cache line delete
functions.
Special uncorrectable error handling
Uncorrectable errors are difficult for any system to tolerate, although there are
some situations where they can be shown to be irrelevant. For example, if an
uncorrectable error occurs in cached data that will never again be read or where
a fresh write of the data is imminent, it would be unwise to "protect" the user by
forcing an immediate reboot.
Special Uncorrectable Error (SUE) handling was an IBM innovation introduced for
POWER5
TM
processors, where an uncorrectable error in memory or cache does not
immediately cause the system to terminate. Rather, the system tags the data and
determines whether it will ever be used again. If the error is irrelevant, it will not
force a checkstop.