IA-32 Intel® Architecture Optimization
2-94
Data elements in parallel. The number of elements which can be
operated on in parallel range from four single-precision floating point
data elements in Streaming SIMD Extensions and two double-precision
floating- point data elements in Streaming SIMD Extensions 2 to
sixteen byte operations in a 128-bit register in Streaming SIMD
Extensions 2. Thus the vector length ranges from 2 to 16, depending on
the instruction extensions used and on the data type.
The Intel C++ Compiler supports vectorization in three ways:
• The compiler may be able to generate SIMD code without
intervention from the user.
• The user inserts pragmas to help the compiler realize that it can
vectorize the code.
• The user may write SIMD code explicitly using intrinsics and C++
classes.
To help enable the compiler to generate SIMD code
• avoid global pointers
• avoid global variables
These may be less of a problem if all modules are compiled
simultaneously, and whole-program optimization is used.
User/Source Coding Rule 17. (H impact, M generality) Use the smallest
possible floating-point or SIMD data type, to enable more parallelism with the
use of a (longer) SIMD vector. For example, use single precision instead of
double precision where possible.
User/Source Coding Rule 18. (M impact, ML generality) Arrange the
nesting of loops so that the innermost nesting level is free of inter-iteration
dependencies. Especially avoid the case where the store of data in an earlier
iteration happens lexically after the load of that data in a future iteration,
something which is called a lexically backward dependence.
The integer part of the SIMD instruction set extensions are primarily
targeted for 16-bit operands. Not all of the operators are supported for
32 bits, meaning that some source code will not be able to be vectorized
at all unless smaller operands are used.