Intel Processor Computer Hardware User Manual


 
B-36 March, 2003 Developers Manual
Intel
®
80200 Processor based on Intel
®
XScale
Microarchitecture
Optimization Guide
; all other registers are in use
sub r1, r6, r7
mul r3,r6, r2
mov r2, r2, LSL #2
orr r9, r9, #0xf
add r0,r4, r5
ldr r6, [r0]
add r8, r6, r8
add r8, r8, #4
orr r8,r8, #0xf
; The value in register r6 is not used after this
In code sample above, ADD and LDR instruction can be moved before the MOV instruction. Note
this would prevent pipeline stalls if the load hits the data cache. However, if load is likely to miss data
cache, move the LDR instruction so that it executes as early as possible - before the SUB instruction.
However, moving the LDR instruction before the SUB instruction would change the program
semantics. It is possible to move the ADD and the LDR instructions before the SUB instruction if we
allow the contents of the register r6 to be spilled and restored from the stack as shown below:
; all other registers are in use
str r6,[sp, #-4]!
add r0,r4,r5
ldr r6, [r0]
mov r2, r2, LSL #2
orr r9, r9, #0xf
add r8, r6, r8
ldr r6, [sp], #4
add r8, r8, #4
orr r8,r8, #0xf
sub r1, r6, r7
mul r3,r6, r2
; The value in register r6 is not used after this
As can be seen above, the contents of the register r6 have been spilled to the stack and subsequently
loaded back to the register r6 to retain the program semantics. Another way to optimize the code
above is with the use of the preload instruction as shown below:
; all other registers are in use
add r0,r4, r5
pld [r0]
sub r1, r6, r7
mul r3,r6, r2
mov r2, r2, LSL #2
orr r9, r9, #0xf
ldr r6, [r0]
add r8, r6, r8
add r8, r8, #4
orr r8,r8, #0xf
; The value in register r6 is not used after this
Intel
®
80200 processor has four fill-buffers used to fetch data from external memory when a
data-cache miss occurs. Intel
®
80200 processor stalls when all fill buffers are in use. This happens
when more than four loads are outstanding and are being fetched from memory. As a result, code
written should ensure no more than four loads are outstanding at same time. For example, number of
loads issued sequentially should not exceed four. Also note, a preload instruction may cause fill buffer
to be used. As a result, number of preload instructions outstanding should also be considered to arrive
at number of loads that are outstanding.
Similarly, number of write buffers also limits number of successive writes issued before the processor
stalls. No more than eight stores can be issued. Also note, if data caches are using write-allocate with
writeback policy, then a load operation may cause stores to external memory if read operation evicts a
cache line that is dirty (modified). The number of sequential stores may be limited by this fact.