3-2 SPARC Enterprise T1000 Server Service Manual • April 2007
■ ALOM CMT firmware – Is the system firmware that runs on the system
controller. In addition to providing the interface between the hardware and OS,
ALOM CMT also tracks and reports the health of key server components. ALOM
CMT works closely with POST and Solaris Predictive Self-Healing technology to
keep the system up and running even when there is a faulty component.
■ Power-on self-test (POST) – Performs diagnostics on system components upon
system reset to ensure the integrity of those components. POST is configurable
and works with ALOM CMT to take faulty components offline if needed and
blacklist them in the asr-db.
■ Solaris OS Predictive Self-Healing (PSH) – This technology continuously
monitors the health of the CPU and memory, and works with ALOM CMT to take
a faulty component offline if needed. The Predictive Self-Healing technology
enables systems to accurately predict component failures and mitigate many
serious problems before they occur.
■ Log files and console messages – Provide the standard Solaris OS log files and
investigative commands that can be accessed and displayed on the device of your
choice.
■ SunVTS™ – An application that exercises the system, provides hardware
validation, and discloses possible faulty components with recommendations for
repair.
The LEDs, ALOM CMT, Solaris OS PSH, and many of the log files and console
messages are integrated. For example, a fault detected by the Solaris PSH software
displays the fault, logs it, passes information to ALOM CMT where it is logged, and
depending on the fault, might illuminate of one or more LEDs.
The flow chart in
FIGURE 3-1 and TABLE 3-1 describes an approach for using the server
diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use,
and the order in which you use them, depend on the nature of the problem you are
troubleshooting, so you might perform some actions and not others.
The flow chart assumes that you have already performed some troubleshooting such
as verification of proper installation and visual inspection of cables and power, and
possibly performed a reset of the server (refer to the SPARC Enterprise T1000 Server
Installation Guide and SPARC Enterprise T1000 Server Administration Guide for
details).
FIGURE 3-1 is a flow chart of the diagnostics available to troubleshoot faulty
hardware.
TABLE 3-1 has more information about each diagnostic in this chapter.
Note – POST is configured with ALOM CMT configuration variables (TABLE 3-6). If
diag_level is set to max (diag_level=max), POST reports all detected FRUs
including memory devices with errors correctable by Predictive Self-Healing (PSH).
Thus, not all memory devices detected by POST need to be replaced. See
Section 3.4.5, “Correctable Errors Detected by POST” on page 3-35.