Intel IA-32 Computer Accessories User Manual


 
Optimizing for SIMD Integer Applications 4
4-41
aligned versions; this can reduce the performance gains when using
the 128-bit SIMD integer extensions. The general guidelines on the
alignment of memory operands are:
The greatest performance gains can be achieved when all
memory streams are 16-byte aligned.
Reasonable performance gains are possible if roughly half of all
memory streams are 16-byte aligned, and the other half are not.
Little or no performance gain may result if all memory streams
are not aligned to 16-bytes; in this case, use of the 64-bit SIMD
integer instructions may be preferable.
Loop counters need to be updated because each 128-bit integer
instruction operates on twice the amount of data as the 64-bit integer
counterpart.
Extension of the pshufw instruction (shuffle word across 64-bit
integer operand) across a full 128-bit operand is emulated by a
combination of the following instructions:
pshufhw, pshuflw,
pshufd.
Use of the 64-bit shift by bit instructions (psrlq, psllq) are
extended to 128 bits in these ways:
—use of
psrlq and psllq, along with masking logic operations
code sequence is rewritten to use the
psrldq and pslldq
instructions (shift double quad-word operand by bytes).
SIMD Optimizations and Microarchitectures
Pentium M, Intel Core Solo and Intel Core Duo processors have a
different microarchitecture than Intel NetBurst
®
microarchitecture. The
following sections discuss optimizing SIMD code that targets Intel Core
Solo and Intel Core Duo processors.
On Intel Core Solo and Intel Core Duo processors, lddqu behaves
identically to movdqu by loading 16 bytes of data irrespective of
address alignment.