112 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
5.7 Streaming-Store/Non-Temporal Instructions
Optimization
Use streaming store instructions such as MOVNTPS and MOVNTQ when writing arrays or buffers
which do not need to reside in cache. These instructions allow the processor to perform a write
without first reading the data from memory or other processor's caches. This saves the time needed to
read the cache line, and also prevents evicting data from the cache which may be needed. This can be
a significant performance advantage. These instructions are available in most compilers using inline
assembly or intrinsics. Routines 5 and 6 in Section 5.13, “Appropriate Memory Copying Routines”
illustrate using the combination of streaming store instructions with the PREFETCHNTA instruction
to optimize memory copy routines.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale
Streaming store instructions are also sometimes called write-combining instructions. In order to
improve system performance, the AMD Athlon 64 and AMD Opteron processors aggressively
combine multiple memory-write cycles of any data size that address locations within a 64-byte cache-
line-aligned write buffer if a streaming-store instruction is used. This combining is accomplished with
write-combine buffers. The number of write-combine buffers is processor-implementation dependent.
Be sure to refer to Appendix B for much more detailed information on write-combining.
Be sure to follow the last streaming-store instruction in a block of code with the MFENCE instruction
to assure that all of the write-combine buffers are written to memory.
Streaming Store instructions are also discussed in “Write-Combining Usage” on page 106. Also see
Appendix B, "Implementation of Write-Combining." For more information on write-combining, see
"Write-Combining" in the AMD64 Architecture Programmer's Manual Volume 2: System
Programming (order# 24593).