196 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
9.2 Improving Scalar SSE and SSE2 Floating-Point
Performance with MOVLPD and MOVLPS When
Loading Data from Memory
Optimization
Use the MOVLPS and MOVLPD instructions to move scalar floating-point data into the XMM
registers prior to addition, multiplication, or other scalar instructions.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale—Single Precision
The MOVSS instruction is used to move scalar single-precision floating-point data into the XMM
registers prior to addition (ADDSS) and multiplication (MULSS) or other scalar instructions. In
addition to loading a 32-bit floating-point value into the XMM register, the MOVSS instruction clears
the upper 96 bits of the register. Clearing part of the XMM register is an inefficiency that you can
bypass by using the MOVLPS instruction. MOVLPS loads two floating-point values from memory
without clearing the upper 64 bits of the XMM register.
The latency of the MOVSS instruction is 3 cycles, whereas the latency of the MOVLPS instruction is
2 cycles. The AMD Athlon™ 64 and AMD Opteron™ processors can perform two 64-bit loads per
clock cycle. Two 64-bit MOVLPS loads can be issued in the same cycle, assuming the data is 8-byte
aligned. Likewise, two MOVSS loads can be performed per cycle, but—unlike MOVLPS—additional
operations that interfere with the MULSS and ADDSS instructions must be issued to clear the
register. Using MOVLPS rather than MOVSS to load single-precision scalar data from memory on
processor-limited floating-point-intensive code can result in significant performance increases.
Consider the following caveats when using the MOVLPS instruction:
• When accessing 4-byte-aligned addresses that are not 8-byte aligned, MOVLPS loads take an
additional cycle.
• Since MOVLPS loads two floating-point values instead of one, accessing the last floating-point
value in a single-precision array attempts to load 4 bytes of additional memory directly after the
end of the array, which may cause an access violation. To avoid an access violation, use MOVSS
to access the last value in a single-precision array or store a dummy floating-point value at the end
of the array.