IA-32 Intel® Architecture Optimization
2-82
String move/store instructions have multiple data granularities. For
efficient data movement, larger data granularities are preferable. This
means better efficiency can be achieved by decomposing an arbitrary
counter value into a number of doublewords plus single byte moves
with a count value less or equal to 3.
Because software can use SIMD data movement instructions to move 16
bytes at a time, the following paragraphs discuss general guidelines for
designing and implementing high-performance library functions such as
memcpy(), memset, and memmove(). There are four factors to be
considered:
• Throughput per iteration:
If two pieces of code have approximately identical path lengths,
efficiency favors choosing instruction that moves larger pieces of
data per iteration. Also, smaller code size per iteration will in
general reduce overhead and improve throughput. Sometimes, this
may involve a comparison of the relative overhead of an iterative
loop structure versus using REP prefix for iteration.
• Address alignment:
Data movement instructions with highest throughput usually have
alignment restrictions, or they operate more efficiently if destination
address is aligned to its natural data size. Specifically, 16-byte
moves need to ensure the destination address is aligned to 16-byte
boundaries; and 8-bytes moves perform better if destination address
is aligned to 8-byte boundaries. Frequently, moving at doubleword
granularity performs better with addresses that are 8-byte aligned.
• REP string move vs. SIMD move:
Implementing general-purpose memory functions using SIMD
extensions usually requires adding some prolog code to ensure the
availability of SIMD instructions at runtime. Throughput
comparison must also take into consideration the overhead of the
prolog when considering a REP string implementation versus a
SIMD approach.