IA-32 Intel® Architecture Optimization
7-38
Frequently, multiple partial writes to WC memory can be combined into
full-sized writes using a software write-combining technique to separate
WC store operations from competing with WB store traffic. To
implement software write-combining, uncacheable writes to memory
with the WC attribute are written to a small, temporary buffer (WB
type) that fits in the first level data cache. When the temporary buffer is
full, the application copies the content of the temporary buffer to the
final WC destination.
When partial-writes are transacted on the bus, the effective data rate to
system memory is reduced to only 1/8 of the system bus bandwidth.
Memory Optimization
Efficient operation of caches is a critical aspect of memory optimiza-
tion. Efficient operation of caches needs to address the following:
• cache blocking
• shared memory optimization
• eliminating 64-K-Aliased data accesses
• preventing excessive evictions in first-level cache
Cache Blocking Technique
Loop blocking is useful for reducing cache misses and improving
memory access performance. The selection of a suitable block size is
critical when applying the loop blocking technique. Loop blocking is
applicable to single-threaded applications as well as to multithreaded
applications running on processors with or without Hyper-Threading
Technology. The technique transforms the memory access pattern into
blocks that efficiently fit in the target cache size.
When targeting IA-32 processors supporting Hyper-Threading
Technology, the loop blocking technique for a unified cache can select a
block size that is no more than one half of the target cache size, if there
are two logical processors sharing that cache. The upper limit of the