Support User Manuals

Intel Processor Computer Hardware User Manual

Open as PDF

of 289

B-38 March, 2003 Developer’s Manual

Intel

®

80200 Processor based on Intel

®

XScale

™

Microarchitecture

Optimization Guide

B.5.1.2. Scheduling Load and Store Multiple (LDM/STM)

LDM and STM instructions have an issue latency of 2-20 cycles depending on the number of

registers being loaded or stored. The issue latency is typically 2 cycles plus an additional cycle for

each of the registers being loaded or stored assuming a data cache hit. The instruction following an

ldm would stall whether or not this instruction depends on the results of the load. A LDRD or

STRD instruction does not suffer from this drawback (except when followed by a memory

operation) and should be used where possible. Consider the task of adding two 64-bit integer

values. Assume that the addresses of these values are aligned on an 8 byte boundary. This can be

achieved using the LDM instructions as shown below:

; r0 contains the address of the value being copied

; r1 contains the address of the destination location

ldm r0, {r2, r3}

ldm r1, {r4, r5}

adds r0, r2, r4

adc r1,r3, r5

If the code were written as shown above, assuming all the accesses hit the cache, the code would

take 11 cycles to complete. Rewriting the code as shown below using LDRD instruction would

take only 7 cycles to complete. The performance would increase further if we can fill in other

instructions after LDRD to reduce the stalls due to the result latencies of the LDRD instructions.

; r0 contains the address of the value being copied

; r1 contains the address of the destination location

ldrd r2, [r0]

ldrd r4, [r1]

adds r0, r2, r4

adc r1,r3, r5

Similarly, the code sequence shown below takes 5 cycles to complete.

stm r0, {r2, r3}

add r1, r1, #1

The alternative version which is shown below would only take 3 cycles to complete.

strd r2, [r0]

add r1, r1, #1

previous next