Chapter 9 Optimizing with SIMD Instructions 229
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
instructions that are not dependent on previous or presently executing operations so that the processor
can mask the execution latency by keeping itself busy, as illustrated below:
Instruction 0 2 4 6 8 10 12 14 16 18
MOVQ xxxxxx
MOVQ xxxxxx
MOVQ xxxxxx
MOVQ xxxxxx
PSWAPD xxxxxx
PSWAPD xxxxxx
PSWAPD xxxxxx
PSWAPD xxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFPNACC xxxxxxxxxxxxxxxxxxx
PFPNACC xxxxxxxxxxxxxxxxxxx
PFPNACC xxxxxxxxxxxxxxxxxxx
PFPNACC xxxxxxxxxxxxxxxxxxx
Multiplying four complex single-precision numbers only takes 17 cycles as opposed to 14 cycles to
multiply one complex single-precision number. The floating-point pipes are kept busy by feeding new
instructions into the floating-point pipeline each cycle. In the arrangement above, 24 floating-point
operations are performed in 17 cycles, achieving more than a 3.5x increase in performance.
The last optimization in both implementations is the use of the MOVNTQ and MOVNTPS
instructions, nontemporal writes to memory that stream data to main memory. These instructions
increase throughput to memory and make more efficient use of the bandwidth provided by the
processor and memory controller. Nontemporal writes, such as MOVNTQ, MOVNTPS, and
MOVNTDQ, should only be used on data that is not going to be accessed again in the near future.