Optimizing Cache Usage 6
6-47
The memory copy algorithm can be optimized using the Streaming
SIMD Extensions with these considerations:
• alignment of data
• proper layout of pages in memory
• cache size
• interaction of the transaction lookaside buffer (TLB) with memory
accesses
• combining prefetch and streaming-store instructions.
The guidelines discussed in this chapter come into play in this simple
example. TLB priming is required for the Pentium 4 processor just as it
is for the Pentium III processor, since software prefetch instructions will
not initiate page table walks on either processor.
TLB Priming
The TLB is a fast memory buffer that is used to improve performance of
the translation of a virtual memory address to a physical memory
address by providing fast access to page table entries. If memory pages
are accessed and the page table entry is not resident in the TLB, a TLB
miss results and the page table must be read from memory.
The TLB miss results in a performance degradation since another
memory access must be performed (assuming that the translation is not
already present in the processor caches) to update the TLB. The TLB
can be preloaded with the page table entry for the next desired page by
accessing (or touching) an address in that page. This is similar to
prefetch, but instead of a data cache line the page table entry is being
loaded in advance of its use. This helps to ensure that the page table
entry is resident in the TLB and that the prefetch happens as requested
subsequently.