AMD 250 Computer Hardware User Manual


 
168 Integer Optimizations Chapter 8
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Use the Largest Possible Operand Size
Always move data using the largest operand size possible. For example, use REP MOVSD rather than
REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather than REP STOSW, and
REP STOSW rather than REP STOSB.
In 64-bit mode, a quadword data size is available and offers better performance (for example,
REP MOVSQ and REP STOSQ).
Ensure DF = 0 (Increment)
Always make sure that DF is 0 (increment) after execution of CLD for rep movs and rep stos.
DF = 1 (decrement) is only needed for certain cases of overlapping rep movs (for example, source
and destination overlap).
While string instructions with DF = 1 (decrement) are slower, only the overhead part of the cycle
equation is larger and not the throughput part. See Table 6 on page 167 for additional latency
numbers.
Align Source and Destination with Operand Size
For rep movs, make sure that both the source and destination are aligned with regard to the operand
size. Handle the end case separately, if necessary. If either source or destination cannot be aligned,
make the destination aligned and the source misaligned. For rep stos, make the destination aligned.
Inline REP String with Low Counts
For repeat counts of less than 4k, expand REP string instructions into equivalent sequences of simple
AMD64 instructions. Use an inline sequence of loads and stores to accomplish the move. Use a
sequence of stores to emulate REP STOS. This technique eliminates the setup overhead of REP
instructions and increases instruction throughput.
Use Loop for REP String with Low Variable Counts
If the repeated count is variable, but is likely less than eight, use a simple loop to move/store the data.
This technique avoids the overhead of REP MOVS and REP STOS.