Intel IA-32 Computer Accessories User Manual


 
General Optimization Guidelines 2
2-49
write misses; only four write-combining buffers are guaranteed to be
available for simultaneous use. Write combining applies to memory
type WC; it does not apply to memory type UC.
Assembly/Compiler Coding Rule 28. (H impact, L generality) If an inner
loop writes to more than four arrays, (four distinct cache lines), apply loop
fission to break up the body of the loop such that only four arrays are being
written to in each iteration of each of the resulting loops.
The write combining buffers are used for stores of all memory types.
They are particularly important for writes to uncached memory: writes
to different parts of the same cache line can be grouped into a single,
full-cache-line bus transaction instead of going across the bus (since
they are not cached) as several partial writes. Avoiding partial writes can
have a significant impact on bus bandwidth-bound graphics applica-
tions, where graphics buffers are in uncached memory. Separating
writes to uncached memory and writes to writeback memory into sepa-
rate phases can assure that the write combining buffers can fill before
getting evicted by other write traffic. Eliminating partial write transac-
tions has been found to have performance impact of the order of 20%
for some applications. Because the cache lines are 64 bytes, a write to
the bus for 63 bytes will result in 8 partial bus transactions.
When coding functions that execute simultaneously on two threads,
reducing the number of writes that are allowed in an inner loop will
help take full advantage of write-combining store buffers. For
write-combining buffer recommendations for Hyper-Threading
Technology, see Chapter 7.
Store ordering and visibility are also important issues for write combin-
ing. When a write to a write-combining buffer for a previously-unwrit-
ten cache line occurs, there will be a read-for-ownership (RFO). If a
subsequent write happens to another write-combining buffer, a separate
RFO may be caused for that cache line. Subsequent writes to the first
cache line and write-combining buffer will be delayed until the second
RFO has been serviced to guarantee properly ordered visibility of the
writes. If the memory type for the writes is write-combining, there will