Chapter 1
Overview
Server Errors
56
Server Errors
To support high availability (HA), the new chipset has included functionality to do error correction, detection
and recovery. Errors in the new chipset are divided into the following categories:
- Protection domain access
- Hardware correctable
- Global shared memory
- Hardware uncorrectable
- Fatal
- Blocking timeout
- Deadlock recovery errors
These categories are listed in increasing severity, ranging from protection domain (PD) access errors, which
are caused by software or hardware running in another PD, to deadlock recovery errors, which indicate a
serious hardware failure that requires a reset of the cell to recover. The term "software" refers to privileged
code, such as PDC or the OS, but not to user code. The sx2000 chipset supports the PD concept, where user
and software errors in one PD cannot affect another PD.
Protection Domain Access Errors
PD access errors are caused by transactions outside the PD that are not allowed. Packets from outside the
coherency set should not impact the interface, and some packets from within the coherency set but outside
the PD are handled as a PD access error. These errors typically occur due to a software error or to bad
hardware in another PD. These errors do not indicate a hardware failure in the reporting cell.
An example of a PD access error is an interrupt from a cell outside the PD that is not part of the interrupt
protection set. For these errors, the sx2000 chipset typically drops the transaction or converts it to a harmless
transaction, and logs the error. No error is signaled. PD access level errors themselves do not result in the
block entering No_shared mode or fatal error mode.
Hardware Corrected Errors
Hardware correctable errors are errors that can be corrected by hardware. A typical example of a hardware
correctable error is a single bit ECC error. For these errors, the sx2000 chipset corrects and logs the error. No
direct notification is given to software that an error has occurred (no LPMC is generated). For firmware or
software to detect that an error has occurred, the error logs must be read.
Global Shared Memory Errrors
Global shared memory (GSM) is a high performance mechanism for communication between separate PDs
using GNI memory without exposing your PD to hardware or software failures of the other PD. Each PD
supports eight sharing ranges. Each of these ranges is readable and writable within the PD, and
programmable to be read_only or readable writable to other PDs. Ranges of memory, called sharing windows,