IA-32 Intel® Architecture Optimization
3-22
Assuming you have a 64-bit aligned data vector and a 64-bit aligned
coefficients vector, the filter operation on the first data element will be fully
aligned. For the second data element, however, access to the data vector
will be misaligned. For an example of how to avoid the misalignment
problem in the FIR filter, please refer to the application notes on Streaming
SIMD Extensions and filters. The application notes are available at
http://developer.intel.com/IDS
.
Duplication and padding of data structures can be used to avoid the
problem of data accesses in algorithms which are inherently misaligned.
The “Data Structure Layout” section discusses further trade-offs for
how data structures are organized.
Stack Alignment For 128-bit SIMD Technologies
For best performance, the Streaming SIMD Extensions and Streaming
SIMD Extensions 2 require their memory operands to be aligned to
16-byte (16B) boundaries. Unaligned data can cause significant
performance penalties compared to aligned data. However, the existing
software conventions for IA-32 (
stdcall, cdecl, fastcall) as
implemented in most compilers, do not provide any mechanism for
ensuring that certain local data and certain parameters are 16-byte
aligned. Therefore, Intel has defined a new set of IA-32 software
conventions for alignment to support the new
__m128* datatypes
(
__m128, __m128d, and __m128i) that meet the following conditions:
CAUTION. The duplication and padding technique
overcomes the misalignment problem, thus avoiding
the expensive penalty for misaligned data access, at
the cost of increasing the data size. When developing
your code, you should consider this tradeoff and use
the option which gives the best performance.