Chapter 8 Integer Optimizations 167
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
8.3 Repeated String Instructions
Optimization
Avoid using the REP prefix when performing string operations, especially when copying blocks of
memory.
Rational
In general, using the REP prefix to repeatedly perform string instructions is less optimal than other
methods, especially when copying blocks of memory. For a discussion of alternate memory-copy
methods, see “Memory Copy” on page 120.
Latency of Repeated String Instructions
Table 6 shows the latency of repeated string instructions on the AMD Athlon 64 and AMD Opteron
processors.
Table 6 lists the latencies with the direction flag (DF) = 0 (increment) and DF = 1 (decrement). In
addition, these latencies are assumed for aligned memory operands. Note that for MOVS and STOS,
when DF = 1, the overhead portion of the latency increases significantly. However, these types are
less commonly found. The user should use the formula and round up to the nearest integer value to
determine the latency.
Guidelines for Repeated String Instructions
To help achieve good performance, the following sections contain guidelines for the careful
scheduling of VectorPath repeated string instructions.
Table 6. Latency of Repeated String Instructions
Number of Cycles
Instruction When ECX = 0
When ECX = c
1
, DF = 0 When ECX = c
1
, DF = 1
rep movs 11 15 + (1 * c) 25 + (4/3 * c)
rep stos 11 14 + (1 * c) 24 + (1 * c)
rep lods 11 15 + (2 * c) 15 + (2 * c)
rep scas 11 15 + (5/2 * c) 15 + (5/2 * c)
rep cmps 11 16 + (10/3 * c) 16 + (10/3 * c)
Note:
1. c > 0