Optimizing for SIMD Floating-point Applications 5
5-9
Now consider the case when the data is organized as SoA. Example 5-2
demonstrates how 4 results are computed for 5 instructions.
For the most efficient use of the four component-wide registers,
reorganizing the data into the SoA format yields increased throughput
and hence much better performance for the instructions used.
As can be seen from this simple example, vertical computation yielded
100% use of the available SIMD registers and produced 4 results. (The
results may vary based on the application.) If the data structures must be
in a format that is not “friendly” to vertical computation, it can be
rearranged “on the fly” to achieve full utilization of the SIMD registers.
This operation is referred to as “swizzling” operation and the reverse
operation is referred to as “deswizzling.”
Data Swizzling
Swizzling data from one format to another may be required in many
algorithms when the available instruction set extension is limited (e.g.,
only SSE is available). An example of this is AoS format, where the
vertices come as
xyz adjacent coordinates. Rearranging them into SoA
format,
xxxx, yyyy, zzzz, allows more efficient SIMD computations.
For efficient data shuffling and swizzling use the following instructions:
• movlps, movhps load/store and move data on half sections of the
registers
• shufps, unpackhps, and unpacklps unpack data
Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation
mulps ; x*x' for all 4 x-components of 4 vertices
mulps ; y*y' for all 4 y-components of 4 vertices
mulps ; z*z' for all 4 z-components of 4 vertices
addps ; x*x' + y*y'
addps ; x*x'+y*y'+z*z'