120 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
5.13 Memory Copy
Optimization
❖ For a very fast general purpose memory copy routine, call the libc memcpy() function included
with the Microsoft or gcc tools. This function features optimizations for all block sizes and
alignments.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale
The memcpy() routines included with recent compilers from Microsoft and gcc feature optimizations
for all block sizes and alignments for AMD Athlon 64 and AMD Opteron processors.
Copying Small Data Structures
Use inline assembly code to copy a small data structure in cache. Use an unrolled series of MOV
instructions. Alternate loads and stores in sequences such as load/store/load/store routines, or use
load/load/store/store sequences for even better performance. Align the destination (and source) if
possible.
Example 1
The following 64-bit example copies 18 bytes of data:
; rsi = source
; rdi = destination
mov r8, [rsi] ; 8 bytes of source
mov r9, [rsi+8] ; next 8 bytes of source
mov [rdi], r8 ; write 8 bytes
mov [rdi+8], r9 ; write next 8
mov r8w, [rsi+16] ; read two bytes "r8 word"
mov [rdi+16], r8w ; write the last 2 bytes
Example 2
The following example illustrates how to copy blocks of 32 bytes and larger, in cache. This code
performs best when the source and destination addresses are 8-byte aligned. Align the destination