Support User Manuals

Intel 80200 Computer Hardware User Manual

Open as PDF

of 289

Developer’s Manual March, 2003 B-17

Intel

®

80200 Processor based on Intel

®

XScale

™

Microarchitecture

Optimization Guide

B.4 Cache and Prefetch Optimizations

This chapter considers how to use the various cache memories in all their modes and then examines

when and how to use prefetch to improve execution efficiencies.

B.4.1 Instruction Cache

The Intel

®

80200 processor has separate instruction and data caches. Only fetched instructions are

held in the instruction cache even though both data and instructions may reside within the same

memory space with each other. Functionally, the instruction cache is either enabled or disabled.

There is no performance benefit in not using the instruction cache. The exception is that code,

which locks code into the instruction cache, must itself execute from non-cached memory.

B.4.1.1. Cache Miss Cost

The Intel

®

80200 processor performance is highly dependent on reducing the cache miss rate.

When an instruction cache miss occurs, the timing to retrieve the next instruction is the same as

that for retrieving data for the data cache. Section B.4.4.1., “Prefetch Distances in the Intel

®

80200

Processor” provides a more detailed explanation of the required time. Using the same assumptions

as those used for the data caches, the result is it takes about 60 to 90 core cycles to retrieve the first

instruction. Once the first 8-byte word is read, it takes another six core cycles to read in the next

two instructions or a total of 78 to 108 clocks to fill a cache line. If the new instructions each

execute in one core cycle, then the processor is stalled for 4 cycles waiting for the next pair of

instructions. Further, if the next pair of instructions each execute in one cycle each, the processor is

again stalled for 4 more cycles. From this it is clear that executing non-cached instructions severely

curtails the processor's performance. It is very important to do everything possible to minimize

cache misses.

B.4.1.2. Round Robin Replacement Cache Policy

Both the data and the instruction caches use a round robin replacement policy to evict a cache line.

The simple consequence of this is that at sometime every line is evicted, assuming a non-trivial

program. The less obvious consequence is that predicting when and over which cache lines

evictions take place is very difficult to predict. This information must be gained by

experimentation using performance profiling.

B.4.1.3. Code Placement to Reduce Cache Misses

Code placement can greatly affect cache misses. One way to view the cache is to think of it as 32

sets of 32 bytes, which span an address range of 1024 bytes. When running, the code maps into 32

blocks modular 1024 of cache space. Any sets, which are overused, thrashes the cache. The ideal

situation is for the software tools to distribute the code on a temporal evenness over this space.

This is very difficult if not impossible for a compiler to do. Most of the input needed to best

estimate how to distribute the code comes from profiling followed by compiler based two pass

optimizations.

previous next