IBM P5 570 Server User Manual


 
56 p5-570 Technical Overview and Introduction
3.2.2 First Failure Data Capture
If a problem should occur, the ability to diagnose it correctly is a fundamental requirement
upon which improved availability is based. The p5-570 incorporates advanced capability in
start-up diagnostics and in run-time First Failure Data Capture (FDDC) based on strategic
error checkers built into the chips.
Any errors that are detected by the pervasive error checkers are captured into Fault Isolation
Registers (FIRs, shown in Figure 3-1), which can be interrogated by the service processor
(SP). The SP in the p5-570 has the capability to access system components using
special-purpose service processor ports or by access to the error registers.
Figure 3-1 Schematic of Fault Isolation Register implementation
The FIRs are important because they enable an error to be uniquely identified, thus enabling
the appropriate action to be taken. Appropriate actions might include such things as a bus
retry, ECC correction, or system firmware recovery routines. Recovery routines could include
dynamic deallocation of potentially failing components.
Errors are logged into the system non-volatile random access memory (NVRAM) and the SP
event history log, along with a notification of the event to AIX for capture in the operating
system error log. Diagnostic Error Log Analysis (
diagela) routines analyze the error log
entries and invoke a suitable action such as issuing a warning message. If the error can be
recovered, or after suitable maintenance, the service processor resets the FIRs so that they
can accurately record any future errors.
The ability to correctly diagnose any pending or firm errors is a key requirement before any
dynamic or persistent component deallocation or any other reconfiguration can take place.
3.2.3 Permanent monitoring
The SP that is included in the p5-570 provides a way to monitor the system even when the
main processor is inoperable. The next subsection offers a more detailed description of
monitoring functions in p5-570.
Mutual surveillance
The SP can monitor the operation of the firmware during the boot process, and it can monitor
the operating system for loss of control. This enables the service processor to take
appropriate action, including calling for service, when it detects that the firmware or the
CPU
L1 Cache
L2/L3 Cache
Memory
F
ault
I
solation
R
egister
(FIR)
(unique fingerprint of each
error captured)
Service
Processo r
Non-volatile
RAM
Error Checkers
Log Error
Disk