Appendix B Implementation of Write-Combining 267
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
better throughput since bus efficiency is increased. This is because bus arbitration overhead is lower:
only one address/attribute phase is issued per burst in the PCI-X case, and one address/command
phase is issued for the AGP Fast Writes case. An illustration of address phase overhead on AGP Fast
Writes is provided in Figure 10 on page 346 in Appendix D, AGP Considerations.
For reasons cited in the precding paragraph, to utilize hardware write chaining efficiently, software
should flush the CPU write-combining buffer in sequential linear address order, any time a target
hardware device is capable of receiving large bursts of CPU write data.
Software should be aware that on AMD64 processors that have multiple write-combining buffers (i.e.
Rev. D, and E processors), events that flush the write-combining buffers (see Appendix B, Table 8.)
will send out the 64-byte WC buffers in the order that the streams were opened. This means that if the
CPU writes to the WC space in the highest 64-byte addressed buffer first (for example address 40h),
and then writes to a lower 64-byte buffer next, (for example address 00h), when those buffers are sent
by the CPU (by HyperTransport to the tunnel), the highest address 64-byte buffer will be sent first,
followed by the second (lower address) 64-byte buffer. Since the addressing is not sequential the
tunnel device will not "chain" both 64-byte WC buffers and must issue 2 separate transactions on the
target bus.
If the above example were targeted for AGP fast writes, issuing two fast write transactions (rather
than issuing one Fast Write transaction) will reduce the bandwidth (data throughput) by 1/3. See
Figure 10 on page 346 in Appendix D.
Optimizations
Adhere to the following guidelines to ensure that Revision D and E AMD Athlon 64 and AMD
Opteron processors issue WC buffers in sequential address order:
• When practical, shadow the data structure in memory (rather than writing the actual WC buffer in
MMI/O space), prior to copying the structure to WC MMI/O space. This will also ensure that the
write-combining buffers are not emptied prematurely by external events (such as a UC read—
perhaps issued by another device driver thread or a hardware interrupt, etc.). Shadowing also
ensures that writes that occur to different cache lines in the structure do not send out the WC
buffers, since the number of WC buffers that can be open at one time is CPU implementation
dependent.
• When ready to update the actual WC MMI/O address space, copy the shadowed structure from
memory to MMI/O, from the lowest address 64-byte block upward. To do the copy, use discrete
loads and stores for up to 64 bytes of data. Use a loop of discrete loads and stores for up to 4KB of
data. Up to 32KB use REP MOVS instructions. To do discrete loads use assembly language, or, if
available, compiler intrinsic functions available (__movsb(), __movsw(), __movsd()), etc.
• In general, using these methods to do the copy will exhibit less overhead in a data movement
function than calling a memcpy( ) LIBC function, which is usually optimized for copying larger
blocks of memory.