Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
3-14
The examples that follow illustrate the use of coding adjustments to
enable the algorithm to benefit from the SSE. The same techniques may
be used for single-precision floating-point, double-precision
floating-point, and integer data under SSE2, SSE, and MMX
technology.
As a basis for the usage model discussed in this section, consider a
simple loop shown in Example 3-8.
Note that the loop runs for only four iterations. This allows a simple
replacement of the code with Streaming SIMD Extensions.
For the optimal use of the Streaming SIMD Extensions that need data
alignment on the 16-byte boundary, all examples in this chapter assume
that the arrays passed to the routine,
a, b, c, are aligned to 16-byte
boundaries by a calling routine. For the methods to ensure this
alignment, please refer to the application notes for the Pentium 4
processor.
The sections that follow provide details on the coding methodologies:
inlined assembly, intrinsics, C++ vector classes, and automatic
vectorization.
Example 3-8 Simple Four-Iteration Loop
void add(float *a, float *b, float *c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}