Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

228 Optimizing with SIMD Instructions Chapter 9

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Additionally, four complex numbers are concurrently multiplied in the examples using SSE and

3DNow! instructions to break up register dependencies. Loads, multiplications, and additions do not

execute with zero delay, but have a latency associated with them. The following instructions:

movq mm0, QWORD PTR [esi+ecx*8] ; mm0 = [x0i,x0r]

pswapd mm4, QWORD PTR [esi+ecx*8] ; mm4 = [x0r,x0i]

pfmul mm0, QWORD PTR [edi+ecx*8] ; mm0 = [x0i*y0i,x0r*y0r]

pfmul mm4, QWORD PTR [edi+ecx*8] ; mm4 = [x0r*y0i,x0i*y0r]

pfpnacc mm0, mm4 ; mm0 = [x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i]

are dependent upon one another. The move from memory (MOVQ) requires 2 cycles, PSWAPD also

requires 2 cycles, the two PFMUL instructions require 6 cycles, and PFPNACC requires 6 cycles.

The instruction flow through the processor is illustrated on a clock-cycle basis, as follows:

Instruction 0 2 4 6 8 10 12 14

MOVQ xxxxxx

PSWAPD xxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFMUL xxxxxxxxxxxxxxxxxx

PFPNACC xxxxxxxxxxxxxxxxxxx

and takes 15 cycles to finish. During this 15 cycles, the processor has the ability to perform 60 single-

precision floating-point operations, of which it only performs six. The majority of the time is spent

waiting for previous instructions to terminate so that arguments to future instructions are available. By

unrolling the multiplication, working with four complex numbers per clock, there are enough

previous next