Chapter 1
Overview
CPUs and Memories
46
Memory Error Protection
All of the CC cache lines are protected in memory by an error correction code (ECC). The sx2000 memory ECC
scheme is significantly different from the sx1000 memory ECC scheme. An ECC code word is contained in
each pair of 144-bit chunks. The memory data path (MDP) block is responsible for checking for and, if
necessary, correcting any correctable errors.
DRAM Erasure
A common cause of a correctable memory error is a DRAM failure, and the ability to correct this type of
memory failure in hardware is sometimes known as chip kill. Address or control bit failure is a common
cause. Chip kill ECC schemes have added hardware logic that allows them to detect and correct more than a
single data bit error when the hardware is programmed to do so. A common implementation of traditional
chip kill is to scatter data bits from each DRAM component across multiple ECC codewords, such that only
one bit from each DRAM is used per ECC codeword.
Double chip kill is an extension to memory chip kill that enables the system to correct multiple ECC errors in
an ECC code word. HP labs developed the ECC algorithm and the first implementation of this technology is in
platforms using the sx2000 chipset. Double chip kill is also known as DRAM erasure.
DRAM erasure is invoked when the number of correctable memory errors exceeds a threshold and can be
invoked on a memory subsystem, bus, rank or bank. PDC tracks the errors that are seen on a memory
subsystem, bus, rank and bank in addition to the error information it tracks in the PDT.
PDC Functional Changes
There are three primary threads of control in the processor dependant code (PDC): the bootstrap, the errors
code, and the PDC procedures. The bootstrap is the primary thread of control until the OS is launched. The
boot console handler (BCH) acts as a user interface for the bootstrap, but can also be used to diagnose
problems with the system by HP support.
The PDC procedures are the primary thread of control once the OS has launched. Once the OS has launched,
the PDC code is only active when the OS calls a PDC procedure or there is an error that causes the error code
to be called.
If a correctable memory error occurs during run time, the new chipset logs the error and corrects it in memory
(reactive scrubbing). Diagnostics periodically read memory module states to read the errors logs. When this
PDC call is made, system firmware updates the PDT, and deletes entries older than 24 hours in the structure
that counts how many errors have occurred for each memory subsystem, bus, rank or bank. When the counts
exceed the thresholds, PDC will invoke DRAM erasure on the appropriate memory subsystem, bus, rank or
bank. Invoking DRAM erasure does not interrupt the operation of the OS.
When PDC invokes DRAM erasure, the information returned by reading memory module states indicate the
scope of the invocation and provides information to allow diagnostics to determine why it was invoked. PDC
also sends IPMI events indicating that DRAM erasure is in use. When PDC invokes DRAM erasure, the
correctable errors that caused DRAM erasure are removed from the PDT. Because invoking DRAM erasure
increases the latency of memory accesses and reduces the ability of ECC to detect multi bit errors, it is
important to notify the customer that the memory subsystem needs to be serviced. HP recommends that the
memory subsystem is serviced within a month of invoking DRAM erasure on a customer machine.
The thresholds for invoking DRAM erasure are incremental so that PDC invokes DRAM erasure on the
smallest part of memory subsystem necessary to protect the system against a another bit being in error.