Intel IA-32 Computer Accessories User Manual


 
General Optimization Guidelines 2
2-71
Recommendation: Use the compiler switch to generate SSE2 scalar
floating-point code over x87 code.
When working with scalar SSE/SSE2 code, pay attention to the need for
clearing the content of unused slots in an xmm register and the
associated performance impact. For example, loading data from
memory with movss or movsd causes an extra micro-op for zeroing
the upper part of the xmm register.
On Pentium M, Intel Core Solo and Intel Core Duo processors; this
penalty can be avoided by using movlpd. However, using movlpd
causes performance penalty on Pentium 4 processors.
Another situation occurs when mixing single-precision and
double-precision code. On Pentium 4 processors, using cvtss2sd has
performance penalty relative to the alternative sequence:
xorps xmm1, xmm1
movss xmm1, xmm2
cvtps2pd xmm1, xmm1
On Intel Core Solo and Intel Core Duo processors, using cvtss2sd is
more desirable over the alternative sequence.
Memory Operands
Double-precision floating-point operands that are eight-byte aligned
have better performance than operands that are not eight-byte aligned,
since they are less likely to incur penalties for cache and MOB splits.
Floating-point operation on a memory operands require that the operand
be loaded from memory. This incurs an additional µop, which can have
a minor negative impact on front end bandwidth. Additionally, memory
operands may cause a data cache miss, causing a penalty.