IA-32 Intel® Architecture Optimization
3-38
Note that this can be applied to both SIMD integer and SIMD
floating-point code.
If there are multiple consumers of an instance of a register, group the
consumers together as closely as possible. However, the consumers
should not be scheduled near the producer.
SIMD Optimizations and Microarchitectures
Pentium M, Intel Core Solo and Intel Core Duo processors have a
different microarchitecture than Intel NetBurst
®
microarchitecture. The
following sub-section discusses optimizing SIMD code targeting Intel
Core Solo and Intel Core Duo processors.
The register-register variant of the following instructions has improved
performance on Intel Core Solo and Intel Core Duo processor relative to
Pentium M processors. This is because the instructions consist of two
micro-ops instead of three. Relevant instructions are: unpcklps,
unpckhps, packsswb, packuswb, packssdw, pshufd, shuffps and shuffpd.
top_of_loop:
movq mm0, [A + eax]
pcmpgtw mm0, [B + eax]; Create compare mask
movq mm1, [D + eax]
pand mm1, mm0; Drop elements where A<B
pandn mm0, [E + eax] ; Drop elements where A>B
por mm0, mm1; Crete single word
movq [C + eax], mm0
add eax, 8
cmp eax, MAX_ELEMENT*2
jle top_of_loop
Example 3-21 Emulation of Conditional Moves (continued)