124 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
5.16 Interleave Loads and Stores
When loading and storing data as in a copy routine, the organization of the sequence of loads and
stores can affect performance.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale
When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the
following pattern—Load, Store, Load, Store, Load, Store, etc. This enables the processor to maxi-
mize the load/store bandwidth.
If using MMX loads and stores in 32-bit mode, the loads and stores should be arranged in the
following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.
Example
The following example illustrates a sequence of 128-bit loads and stores:
movdqa xmm0,[rdx+r8*8] ; Load
movntdq [rcx+r8*8],xmm0 ; Store
movdqa xmm1,[rdx+r8*8+16] ; Load
movntdq [rcx+r8*8+16],xmm1 ; Store