Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
3-24
Another way to improve data alignment is to copy the data into
locations that are aligned on 64-bit boundaries. When the data is
accessed frequently, this can provide a significant performance
improvement.
Data Alignment for 128-bit data
Data must be 16-byte aligned when loading to or storing from the
128-bit XMM registers used by SSE and SSE2 to avoid severe
performance penalties at best, and at worst, execution faults. Although
there are move instructions (and intrinsics) to allow unaligned data to be
copied into and out of the XMM registers when not using aligned data,
such operations are much slower than aligned accesses. If, however, the
data is not 16-byte-aligned and the programmer or the compiler does not
detect this and uses the aligned instructions, a fault will occur. So, the
rule is: keep the data 16-byte-aligned. Such alignment will also work for
MMX technology code, even though MMX technology only requires
8-byte alignment. The following discussion and examples describe
alignment techniques for Pentium 4 processor as implemented with the
Intel C++ Compiler.
Compiler-Supported Alignment
The Intel C++ Compiler provides the following methods to ensure that
the data is aligned.
Alignment by
F32vec4 or __m128 Data Types. When compiler detects
F32vec4 or __m128 data declarations or parameters, it will force
alignment of the object to a 16-byte boundary for both global and local
data, as well as parameters. If the declaration is within a function, the
compiler will also align the function's stack frame to ensure that local
data and parameters are 16-byte-aligned. For details on the stack frame
layout that the compiler generates for both debug and optimized
(“release”-mode) compilations, please refer to the relevant Intel
application notes in the Intel Architecture Performance Training Center
provided with the SDK.