Intel Processor Computer Hardware User Manual


 
Developers Manual March, 2003 B-39
Intel
®
80200 Processor based on Intel
®
XScale
Microarchitecture
Optimization Guide
B.5.2 Scheduling Data Processing Instructions
Most Intel
®
80200 processor data processing instructions have a result latency of 1 cycle. This
means that the current instruction is able to use the result from the previous data processing
instruction. However, the result latency is 2 cycles if the current instruction needs to use the result
of the previous data processing instruction for a shift by immediate. As a result, the following code
segment would incur a 1 cycle stall for the mov instruction:
sub r6, r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
The code above can be rearranged as follows to remove the 1 cycle stall:
add r1, r2, r3
sub r6, r7, r8
mov r4, r1, LSL #2
All data processing instructions incur a 2 cycle issue penalty and a 2 cycle result penalty when the
shifter operand is a shift/rotate by a register or shifter operand is RRX. Since the next instruction
would always incur a 2 cycle issue penalty, there is no way to avoid such a stall except by
re-writing the assembler instruction. Consider the following segment of code:
mov r3, #10
mul r4, r2, r3
add r5, r6, r2, LSL r3
sub r7, r8, r2
The subtract instruction would incur a 1 cycle stall due to the issue latency of the add instruction as
the shifter operand is shift by a register. The issue latency can be avoided by changing the code as
follows:
mov r3, #10
mul r4, r2, r3
add r5, r6, r2, LSL #10
sub r7, r8, r2