IBM DS8000 Computer Drive User Manual


 
64 DS8000 Series: Concepts and Architecture
Reliability, availability, and serviceability
Excellent quality and reliability are inherent in all aspects of the IBM Server p5 design and
manufacturing. The fundamental objective of the design approach is to minimize outages.
The RAS features help to ensure that the system performs reliably, and efficiently handles
any failures that may occur. This is achieved by using capabilities that are provided by both
the hardware, AIX 5L, and RAS code written specifically for the DS8000. The following
sections describe the RAS leadership features of IBM Server p5 systems in more detail.
Fault avoidance
POWER5 systems are built to keep errors from ever happening. This quality-based design
includes such features as reduced power consumption and cooler operating temperatures for
increased reliability, enabled by the use of copper chip circuitry, SOI (silicon on insulator), and
dynamic clock-gating. It also uses mainframe-inspired components and technologies.
First Failure Data Capture
If a problem should occur, the ability to diagnose it correctly is a fundamental requirement
upon which improved availability is based. The p5 570 incorporates advanced capability in
start-up diagnostics and in run-time First Failure Data Capture (FFDC) based on strategic
error checkers built into the chips.
Any errors that are detected by the pervasive error checkers are captured into Fault Isolation
Registers (FIRs), which can be interrogated by the service processor (SP). The SP in the p5
570 has the capability to access system components using special-purpose service
processor ports or by access to the error registers.
The FIRs are important because they enable an error to be uniquely identified, thus enabling
the appropriate action to be taken. Appropriate actions might include such things as a bus
retry, ECC (error checking and correction), or system firmware recovery routines. Recovery
routines could include dynamic deallocation of potentially failing components.
Errors are logged into the system non-volatile random access memory (NVRAM) and the SP
event history log, along with a notification of the event to AIX for capture in the operating
system error log. Diagnostic Error Log Analysis (diagela) routines analyze the error log
entries and invoke a suitable action, such as issuing a warning message. If the error can be
recovered, or after suitable maintenance, the service processor resets the FIRs so that they
can accurately record any future errors.
The ability to correctly diagnose any pending or firm errors is a key requirement before any
dynamic or persistent component deallocation or any other reconfiguration can take place.
Permanent monitoring
The SP that is included in the p5 570 provides a way to monitor the system even when the
main processor is inoperable. The next subsection offers a more detailed description of the
monitoring functions in the p5 570.
Mutual surveillance
The SP can monitor the operation of the firmware during the boot process, and it can monitor
the operating system for loss of control. This enables the service processor to take
appropriate action when it detects that the firmware or the operating system has lost control.
Mutual surveillance also enables the operating system to monitor for service processor
activity and can request a service processor repair action if necessary.
Environmental monitoring
Environmental monitoring related to power, fans, and temperature is performed by the
System Power Control Network (SPCN). Environmental critical and non-critical conditions