C-6 PPC405 Core User’s Manual
C.2.5 Scalar Store Instructions
Cachable stores that miss in the DCU, and noncachable stores, are queued in the data cache so that
the store appears to execute in a single cycle if operand-aligned. Under certain conditions, the DCU
can pipeline up to three store instructions. (See Chapter 4, “Cache Operations,” for more information.)
stwcx. instructions that do not cause alignment errors execute in two cycles.
C.2.6 Alignment in Scalar Load and Store Instructions
The PPC405 requires an extra cycle to execute scalar loads and stores having unaligned big or little
endian data (except for lwarx and stwcx., which require word-aligned operands). If the target data is
not operand aligned, and the sum of the least two significant bits of the effective address (EA) and the
byte count is greater than four, the PPC405 decomposes a load or store scalar into two load or store
operations. That is, the PPC405 never presents the DCU with a request for a transfer that crosses a
word boundary. For example, a lwz with an EA of 0b11 causes the PPC405 to decompose the lwz
into two load operations. The first load operation is for a byte at the starting effective address; the
second load operation is for three bytes, starting at the next word address.
C.2.7 String and Multiple Instructions
Calculating execution times for string and multiple instructions (lmw and stmw) instructions requires
an understanding of data alignment, and of the behavior of the string instructions with respect to
alignment.
In the following example, the string contains 21 bytes. The first three bytes do not begin on a word
boundary, and the final two bytes do not end on a word boundary. The PPC405 handles any
unaligned leading bytes as a special case, then moves as many bytes as aligned words as possible,
and finally handles any unaligned trailing bytes as a special case.
In the following example, arrows indicate word boundaries (the address is an exact multiple of four);
shaded boxes represent unaligned bytes.
The execution time of the string instruction is the sum of the:
1. Cycles required to handle unaligned leading bytes; if any, add one clock cycle.
In the example, there are unaligned leading bytes; this transfer adds one clock cycle.
2. Cycles required to handle the number of word-aligned transfers required. Assuming data cache
hits, each word-aligned transfer requires one clock cycle.
In the example, there are four aligned words; this transfer requires four clock cycles.
3. Cycles required to handle unaligned trailing bytes; if any, add one clock cycle.
In the example, there are unaligned trailing bytes; this transfer adds one clock cycle.
A string instruction operating on the example 21-byte string requires six clock cycles.