B-38 March, 2003 Developer’s Manual
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
B.5.1.2. Scheduling Load and Store Multiple (LDM/STM)
LDM and STM instructions have an issue latency of 2-20 cycles depending on the number of
registers being loaded or stored. The issue latency is typically 2 cycles plus an additional cycle for
each of the registers being loaded or stored assuming a data cache hit. The instruction following an
ldm would stall whether or not this instruction depends on the results of the load. A LDRD or
STRD instruction does not suffer from this drawback (except when followed by a memory
operation) and should be used where possible. Consider the task of adding two 64-bit integer
values. Assume that the addresses of these values are aligned on an 8 byte boundary. This can be
achieved using the LDM instructions as shown below:
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldm r0, {r2, r3}
ldm r1, {r4, r5}
adds r0, r2, r4
adc r1,r3, r5
If the code were written as shown above, assuming all the accesses hit the cache, the code would
take 11 cycles to complete. Rewriting the code as shown below using LDRD instruction would
take only 7 cycles to complete. The performance would increase further if we can fill in other
instructions after LDRD to reduce the stalls due to the result latencies of the LDRD instructions.
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldrd r2, [r0]
ldrd r4, [r1]
adds r0, r2, r4
adc r1,r3, r5
Similarly, the code sequence shown below takes 5 cycles to complete.
stm r0, {r2, r3}
add r1, r1, #1
The alternative version which is shown below would only take 3 cycles to complete.
strd r2, [r0]
add r1, r1, #1