228 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Additionally, four complex numbers are concurrently multiplied in the examples using SSE and
3DNow! instructions to break up register dependencies. Loads, multiplications, and additions do not
execute with zero delay, but have a latency associated with them. The following instructions:
movq mm0, QWORD PTR [esi+ecx*8] ; mm0 = [x0i,x0r]
pswapd mm4, QWORD PTR [esi+ecx*8] ; mm4 = [x0r,x0i]
pfmul mm0, QWORD PTR [edi+ecx*8] ; mm0 = [x0i*y0i,x0r*y0r]
pfmul mm4, QWORD PTR [edi+ecx*8] ; mm4 = [x0r*y0i,x0i*y0r]
pfpnacc mm0, mm4 ; mm0 = [x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i]
are dependent upon one another. The move from memory (MOVQ) requires 2 cycles, PSWAPD also
requires 2 cycles, the two PFMUL instructions require 6 cycles, and PFPNACC requires 6 cycles.
The instruction flow through the processor is illustrated on a clock-cycle basis, as follows:
Instruction 0 2 4 6 8 10 12 14
MOVQ xxxxxx
PSWAPD xxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFMUL xxxxxxxxxxxxxxxxxx
PFPNACC xxxxxxxxxxxxxxxxxxx
and takes 15 cycles to finish. During this 15 cycles, the processor has the ability to perform 60 single-
precision floating-point operations, of which it only performs six. The majority of the time is spent
waiting for previous instructions to terminate so that arguments to future instructions are available. By
unrolling the multiplication, working with four complex numbers per clock, there are enough