Optimizing for SIMD Integer Applications 4
4-39
Increasing Bandwidth of Memory Fills and Video Fills
It is beneficial to understand how memory is accessed and filled. A
memory-to-memory fill (for example a memory-to-video fill) is defined
as a 64-byte (cache line) load from memory which is immediately stored
back to memory (such as a video frame buffer). The following are
guidelines for obtaining higher bandwidth and shorter latencies for
sequential memory fills (video fills). These recommendations are
relevant for all Intel architecture processors with MMX technology and
refer to cases in which the loads and stores do not hit in the first- or
second-level cache.
Increasing Memory Bandwidth Using the MOVDQ
Instruction
Loading any size data operand will cause an entire cache line to be
loaded into the cache hierarchy. Thus any size load looks more or less
the same from a memory bandwidth perspective. However, using many
smaller loads consumes more microarchitectural resources than fewer
larger stores. Consuming too many of these resources can cause the
processor to stall and reduce the bandwidth that the processor can
request of the memory subsystem.
Using
movdq to store the data back to UC memory (or WC memory in
some cases) instead of using 32-bit stores (for example,
movd) will
reduce by three-quarters the number of stores per memory fill cycle. As
a result, using the
movdq instruction in memory fill cycles can achieve
significantly higher effective bandwidth than using the
movd instruction.
Increasing Memory Bandwidth by Loading and Storing to
and from the Same DRAM Page
DRAM is divided into pages, which are not the same as operating
system (OS) pages. The size of a DRAM page is a function of the total
size of the DRAM and the organization of the DRAM. Page sizes of
several Kilobytes are common. Like OS pages, DRAM pages are
constructed of sequential addresses. Sequential memory accesses to the