Intel Processor Computer Hardware User Manual


 
Developers Manual March, 2003 B-35
Intel
®
80200 Processor based on Intel
®
XScale
Microarchitecture
Optimization Guide
B.5 Instruction Scheduling
This chapter discusses instruction scheduling optimizations. Instruction scheduling refers to the
rearrangement of a sequence of instructions for the purpose of minimizing pipeline stalls. Reducing
the number of pipeline stalls improves application performance. While making this rearrangement,
care should be taken to ensure that the rearranged sequence of instructions has the same effect as
the original sequence of instructions.
B.5.1 Scheduling Loads
On the Intel
®
80200 processor, an LDR instruction has a result latency of 3 cycles assuming the
data being loaded is in the data cache. If the instruction after the LDR needs to use the result of the
load, then it would stall for 2 cycles. If possible, the instructions surrounding the LDR instruction
should be rearranged
to avoid this stall. Consider the following example:
add r1, r2, r3
ldr r0, [r5]
add r6, r0, r1
sub r8, r2, r3
mul r9, r2, r3
In the code shown above, the ADD instruction following the LDR would stall for 2 cycles because
it uses the result of the load. The code can be rearranged as follows to prevent the stalls:
ldr r0, [r5]
add r1, r2, r3
sub r8, r2, r3
add r6, r0, r1
mul r9, r2, r3
Note that this rearrangement may not be always possible. Consider the following example:
cmp r1, #0
addne r4, r5, #4
subeq r4, r5, #4
ldr r0, [r4]
cmp r0, #10
In the example above, the LDR instruction cannot be moved before the ADDNE or the SUBEQ
instructions because the LDR instruction depends on the result of these instructions. Rewrite the
above code to make it run faster at the expense of increasing code size:
cmp r1, #0
ldrne r0, [r5, #4]
ldreq r0, [r5, #-4]
addne r4, r5, #4
subeq r4, r5, #4
cmp r0, #10
The optimized code takes six cycles to execute compared to the seven cycles taken by the
unoptimized version.
The result latency for an LDR instruction is significantly higher if the data being loaded is not in the
data cache. To minimize the number of pipeline stalls in such a situation the LDR instruction should
be moved as far away as possible from the instruction that uses result of the load. Note that this may
at times cause certain register values to be spilled to memory due to the increase in register pressure.
In such cases, use a preload instruction or a preload hint to ensure that the data access in the LDR
instruction hits the cache when it executes. A preload hint should be used in cases where we cannot be
sure whether the load instruction would be executed. A preload instruction should be used in cases
where we can be sure that the load instruction would be executed. Consider following code sample: