B-26 March, 2003 Developer’s Manual
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
The Intel
®
80200 processor needs seven bus clocks to process a memory request to the SDRAM
(N
processor
). Typical SDRAM needs 2 to 3 bus clocks to select the memory locations provided that
the current SDRAM memory page is selected (N
memwait
). If the current SDRAM memory page is
not selected, then an additional 3 to 4 bus cycles are required to lookup the memory data locations
(N
mempagewait
). Thus the lookup time can range from 9 to 14 bus clock cycles. Translating this to
core cycles at a ratio of six to one means between 54 and 84 core clocks.
N
cwfxfer
This is the number of core clocks required to transfer the first critical word of a cache line
fill operation. It takes one bus clock to transfer the first word if the data is in the lower
word address of the transfer and one additional core clock if the word is in the upper
word address range of the transfer. Thus for the examples presented here this would be 6
or 7 core clock cycles.
N
cwf
for the Intel
®
80200 processor works out to be 60 instructions assuming 2 wait state SDRAM
and that the current SDRAM memory page is selected. The second 64 bits of data are available at
the next bus cycle or 6 core clocks
N
clxfer
is the minimal number of cycles to prefetch ahead for an entire cache line:
Where:
N
linexfer
This is the number of core clocks required to transfer one complete cache line. The Intel
®
80200 processor requires 4 bus cycles to transfer four 64 bit words of a full cache line.
Given the six to one core to bus clock ratio this would be 24 core clock cycles.
N
clxfer
works out to be about 78 cycles for the Intel
®
80200 processor when using 2 bus cycle wait
state
N
subissue
This is the maximum number of core clocks that a subsequent bus transfer request must be
made to guarantee that transfer takes place immediately after the previous request has
completed its transfer. If a transfer is not made in this time, then idle bus cycles occur
reducing efficiently. This time transfer time of the previous request. If the previous
transfer was for a full cache line read or write, then this would take 24 core cycles at a
six to one ratio between core and bus clocks. If the previous operation was for a half
cache line, then this would be done in 12 core clocks.
Consider the following code sample:
add r1, r1, #1
; Sequence of instructions that use r2. These instructions leave r3 unchanged.
ldr r2, [r3]
add r3, r3, #4
mov r4, r3
sub r2, r2, #1
The sub instruction above would stall if the data being loaded misses the cache. These stalls can be
avoided by using a pld instruction well ahead as shown below. The number of instructions required
to insure a stall does not occur is proportional to N
cwf
for a given system.
pld [r3]
add r1, r1, #1
; Sequence of instructions that use r2. These instructions leave r3 unchanged.
ldr r2, [r3]
add r3, r3, #4
mov r4, r3
sub r2, r2, #1
N
clxfer
N
lookup
N
linexfer
+=