348 AGP Considerations Appendix D
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
If there are any “empty” doublewords between the last parameter and the top of the cache line, use the
SFENCE instruction to flush the write-combining buffer. The data is issued in ascending order.
SFENCE is needed to flush the processor’s write-combining buffer on any partially filled buffer. In
general, use SFENCE when all parameters needed for rendering have been copied to the memory-
mapped I/O (MMIO) FIFO. This ensures that write data is not kept in the processor’s write-
combining buffer (which prevents the graphics engine from receiving an incomplete command until
the buffer is eventually flushed).
The AGP 3.0 specification specifies that accelerators must be able to buffer at least 128 bytes for the
initial data block transferred. Try using 64–128 bytes as the optimal transfer size whenever possible
(one to two processor cache lines). Map as many commands as will fit into this 64–128-byte structure.
Listing 31. Sending Write-Combined Data to the Graphics-Engine Command FIFO
/* Send commands to a graphic accelerator 2D engine. */
/* The shadowed structure contains 32 DWORDs worth of */
/* rendering commands and data parameters. */
/* Send out 128 (80h) bytes to FIFO in WC MMIO space. */
/* First load 64-bit pointer to a cached command structure. */
mov rdi, OFFSET ShadowRegs_Structure
/* We now have a pointer to the shadowed engine structure. */
/* Grab 16 bytes at a time. */
movdqa xmm0, [rdi]
movdqa xmm1, [rdi + 16]
movdqa xmm2, [rdi + 32]
movdqa xmm3, [rdi + 48]
movdqa xmm4, [rdi + 64]
movdqa xmm5, [rdi + 80]
movdqa xmm6, [rdi + 96]
movdqa xmm7, [rdi + 112]
/* Now get linear pointer to graphic engine mapped in */
/* WC address space. */
mov rax, PTR [Linear2Dengine_Ptr]
/* Now copy register data to processor’s WC buffer. */
/* It is slightly more optimal if the command FIFO */
/* is at a cache-line-aligned address. */
/* Write 16 bytes at a time. */
movdqa [rax], xmm0
movdqa [rax + 16], xmm1
movdqa [rax + 32], xmm2
/* The first WC buffer will be sent after the next write */
/* (assuming FIFO is cache-line aligned) since we are crossing */
/* a cache-line boundary. */