Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

Chapter 9 Optimizing with SIMD Instructions 229

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

instructions that are not dependent on previous or presently executing operations so that the processor

can mask the execution latency by keeping itself busy, as illustrated below:

Instruction 0 2 4 6 8 10 12 14 16 18

MOVQ xxxxxx

MOVQ xxxxxx

MOVQ xxxxxx

MOVQ xxxxxx

PSWAPD xxxxxx

PSWAPD xxxxxx

PSWAPD xxxxxx

PSWAPD xxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFPNACC xxxxxxxxxxxxxxxxxxx

PFPNACC xxxxxxxxxxxxxxxxxxx

PFPNACC xxxxxxxxxxxxxxxxxxx

PFPNACC xxxxxxxxxxxxxxxxxxx

Multiplying four complex single-precision numbers only takes 17 cycles as opposed to 14 cycles to

multiply one complex single-precision number. The floating-point pipes are kept busy by feeding new

instructions into the floating-point pipeline each cycle. In the arrangement above, 24 floating-point

operations are performed in 17 cycles, achieving more than a 3.5x increase in performance.

The last optimization in both implementations is the use of the MOVNTQ and MOVNTPS

instructions, nontemporal writes to memory that stream data to main memory. These instructions

increase throughput to memory and make more efficient use of the bandwidth provided by the

processor and memory controller. Nontemporal writes, such as MOVNTQ, MOVNTPS, and

MOVNTDQ, should only be used on data that is not going to be accessed again in the near future.

previous next