Intel IA-32 Computer Accessories User Manual


 
Optimizing for SIMD Floating-point Applications 5
5-27
SIMD Optimizations and Microarchitectures
Pentium M, Intel Core Solo and Intel Core Duo processors have a
different microarchitecture than Intel NetBurst
®
microarchitecture. The
following sub-section discusses optimizing SIMD code that target Intel
Core Solo and Intel Core Duo processors.
Packed Floating-Point Performance
Most packed SIMD floating-point code will speed up on Intel Core Solo
processors relative to Pentium M processors. This is due to
improvement in decoding packed SIMD instructions.
The improvement of packed floating-point performance on the Intel
Core Solo processor over Pentium M processor depends on several
factors. Generally, code that is decoder-bound and/or has a mixture of
integer and packed floating-point instructions can expect significant
gain. Code that is limited by execution latency and has a “cycles per
instructions” ratio greater than one will not benefit from decoder
improvement.
movaps xmm0, Vector1 ; the destination has a3, a2, a1, a0
movaps xmm1, Vector2 ; the destination has b3, b2, b1, b0
movaps xmm2, Vector3 ; the destination has c3, c2, c1, c0
movaps xmm3, Vector4 ; the destination has d3, d2, d1, d0
mulps xmm0, xmm1 ; a3b3, a2b2, a1b1, a0b0
mulps xmm2, xmm3 ; c3d3, c2d2, c1d1, c0d0
haddps xmm0, xmm2 ; the destination has c3d3+c2d2,
; c1d1+c0d0,a3b3+a2b2,a1b1+a0b0
haddps xmm0, xmm0 ; the destination has
; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0,
; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0
Example 5-13 Calculating Dot Products from AOS (continued)