Configuring and Deconfiguring Processors or Memory
All failures that crash the system with a machine check or check stop, even if
intermittent, are reported as a diagnostic callout for service repair. To prevent the
recurrence of intermittent problems and improve the availability of the system until a
scheduled maintenance window, processors and memory books with a failure history
are marked ″bad″ to prevent their being configured on subsequent boots.
A processor or memory book is marked ″bad″ under the following circumstances:
v A processor or memory book fails built-in self-test (BIST) or power-on self-test
(POST) testing during boot (as determined by the service processor).
v A processor or memory book causes a machine check or check stop during runtime,
and the failure can be isolated specifically to that processor or memory book (as
determined by the processor runtime diagnostics in the service processor).
v A processor or memory book reaches a threshold of recovered failures that results in
a predictive callout (as determined by the processor run-time diagnostics in the
service processor).
During boot time, the service processor does not configure processors or memory
books that are marked “bad.”
If a processor or memory book is deconfigured, the processor or memory book remains
offline for subsequent reboots until it is replaced or repeat gard is disabled. The repeat
gard function also provides the user with the option of manually deconfiguring a
processor or memory book, or re-enabling a previously deconfigured processor or
memory book. For information on configuring or deconfiguring a processor, see the
Processor Configuration/Deconfiguration Menu on page 33.
For information on configuring or deconfiguring a memory book, see the Memory
Configuration/Deconfiguration Menu on page 35. Both of these menus are submenus
under the System Information Menu.
You can enable or disable CPU Repeat Gard or Memory Repeat Gard using the
Processor Configuration/Deconfiguration Menu.
Run-Time CPU Deconfiguration (CPU Gard)
L1 instruction cache recoverable errors, L1 data cache correctable errors, and L2 cache
correctable errors are monitored by the processor runtime diagnostics (PRD) code
running in the service processor. When a predefined error threshold is met, an error log
with warning severity and threshold exceeded status is returned to AIX. At the same
time, PRD marks the CPU for deconfiguration at the next boot. AIX will attempt to
migrate all resources associated with that processor to another processor and then stop
the defective processor.
Chapter 3. Using the Service Processor 55