IA-32 Intel® Architecture Optimization
5-6
simultaneously referred to as an xyz data representation, see the
diagram below) are computed in parallel, and the array is updated one
vertex at a time.
When data structures are organized for the horizontal computation
model, sometimes the availability of homogeneous arithmetic
operations in SSE and SSE2 may cause inefficiency or require
additional intermediate movement between data elements.
Alternatively, the data structure can be organized in the SoA format.
The SoA data structure enables a vertical computation technique, and is
recommended over horizontal computation for many applications, for
the following reasons:
• When computing on a single vector (xyz), it is common to use only
a subset of the vector components; for example, in 3D graphics the
W
component is sometimes ignored. This means that for single-vector
operations, 1 of 4 computation slots is not being utilized. This
typically results in a 25% reduction of peak efficiency.
• It may become difficult to hide long latency operations. For
instance, another common function in 3D graphics is normalization,
which requires the computation of a reciprocal square root (that is,
1/sqrt). Both the division and square root are long latency
operations. With vertical computation (SoA), each of the 4
computation slots in a SIMD operation is producing a unique result,
so the net latency per slot is L/4 where L is the overall latency of the
operation. However, for horizontal computation, the 4 computation
slots each produce the same result, hence to produce 4 separate
results requires a net latency per slot of L.
XYZW