IA-32 Intel® Architecture Optimization
5-4
When using scalar floating-point instructions, it is not necessary to
ensure that the data appears in vector form. However, all of the
optimizations regarding alignment, scheduling, instruction selection,
and other optimizations covered in Chapter 2 and Chapter 3 should be
observed.
Data Alignment
SIMD floating-point data is 16-byte aligned. Referencing unaligned
128-bit SIMD floating-point data will result in an exception unless the
movups or movupd (move unaligned packed single or unaligned packed
double) instruction is used. The unaligned instructions used on aligned
or unaligned data will also suffer a performance penalty relative to
aligned accesses.
Refer to section “Stack and Data Alignment” in Chapter 3 for more
information.
Data Arrangement
Because the SSE and SSE2 incorporate a SIMD architecture, arranging
the data to fully use the SIMD registers produces optimum performance.
This implies contiguous data for processing, which leads to fewer cache
misses and can potentially quadruple the data throughput when using
SSE, or twice the throughput when using SSE2. These performance
gains can occur because four data element can be loaded with 128-bit
load instructions into XMM registers using SSE (
movaps – move
aligned packed single precision). Similarly, two data element can loaded
with 128-bit load instructions into XMM registers using SSE2 (
movapd
– move aligned packed double precision).
Refer to the “Stack and Data Alignment” in Chapter 3 for data
arrangement recommendations. Duplicating and padding techniques
overcome the misalignment problem that can occur in some data
structures and arrangements. This increases the data space but avoids
the expensive penalty for misaligned data access.