Developer’s Manual March, 2003 B-17
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
B.4 Cache and Prefetch Optimizations
This chapter considers how to use the various cache memories in all their modes and then examines
when and how to use prefetch to improve execution efficiencies.
B.4.1 Instruction Cache
The Intel
®
80200 processor has separate instruction and data caches. Only fetched instructions are
held in the instruction cache even though both data and instructions may reside within the same
memory space with each other. Functionally, the instruction cache is either enabled or disabled.
There is no performance benefit in not using the instruction cache. The exception is that code,
which locks code into the instruction cache, must itself execute from non-cached memory.
B.4.1.1. Cache Miss Cost
The Intel
®
80200 processor performance is highly dependent on reducing the cache miss rate.
When an instruction cache miss occurs, the timing to retrieve the next instruction is the same as
that for retrieving data for the data cache. Section B.4.4.1., “Prefetch Distances in the Intel
®
80200
Processor” provides a more detailed explanation of the required time. Using the same assumptions
as those used for the data caches, the result is it takes about 60 to 90 core cycles to retrieve the first
instruction. Once the first 8-byte word is read, it takes another six core cycles to read in the next
two instructions or a total of 78 to 108 clocks to fill a cache line. If the new instructions each
execute in one core cycle, then the processor is stalled for 4 cycles waiting for the next pair of
instructions. Further, if the next pair of instructions each execute in one cycle each, the processor is
again stalled for 4 more cycles. From this it is clear that executing non-cached instructions severely
curtails the processor's performance. It is very important to do everything possible to minimize
cache misses.
B.4.1.2. Round Robin Replacement Cache Policy
Both the data and the instruction caches use a round robin replacement policy to evict a cache line.
The simple consequence of this is that at sometime every line is evicted, assuming a non-trivial
program. The less obvious consequence is that predicting when and over which cache lines
evictions take place is very difficult to predict. This information must be gained by
experimentation using performance profiling.
B.4.1.3. Code Placement to Reduce Cache Misses
Code placement can greatly affect cache misses. One way to view the cache is to think of it as 32
sets of 32 bytes, which span an address range of 1024 bytes. When running, the code maps into 32
blocks modular 1024 of cache space. Any sets, which are overused, thrashes the cache. The ideal
situation is for the software tools to distribute the code on a temporal evenness over this space.
This is very difficult if not impossible for a compiler to do. Most of the input needed to best
estimate how to distribute the code comes from profiling followed by compiler based two pass
optimizations.