AMD 250 Computer Hardware User Manual


 
Chapter 7 Scheduling Optimizations 147
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
Without loop unrolling, this is the equivalent assembly-language code:
mov ecx, MAX_LENGTH ; Initialize counter.
mov eax, OFFSET a ; Load address of array a into EAX.
mov ebx, OFFSET b ; Load address of array b into EBX.
add_loop:
fld QWORD PTR [eax] ; Push object pointed to by EAX onto the FP stack.
fadd QWORD PTR [ebx] ; Add object pointed to by EBX to ST(0).
fstp QWORD PTR [eax] ; Copy ST(0) to object pointed to by EAX; pop ST(0).
add eax, 8 ; Point to next element of array a.
add ebx, 8 ; Point to next element of array b.
dec ecx ; Decrement counter.
jnz add_loop ; If elements remain, then jump.
The rolled loop consists of seven instructions. AMD Athlon 64 and AMD Opteron processors can
decode and retire as many as three instructions per cycle, so it cannot execute faster than three
iterations in seven cycles (3/7 of a floating-point add per cycle). However, the pipelined floating-point
adder allows one add every cycle.
After partial loop unrolling using an unroll factor of two, the new code creates a potential end case
that must be handled outside the loop:
mov ecx, MAX_LENGTH ; Initialize counter.
mov eax, OFFSET a ; Load address of array a into EAX.
mov ebx, OFFSET b ; Load address of array b into EBX.
shr ecx, 1 ; Divide counter by 2 (the unroll factor).
jnc add_loop ; If original counter was even, then jump.
; Handle the end case.
fld QWORD PTR [eax] ; Push object pointed to by EAX onto the FP stack.
fadd QWORD PTR [ebx] ; Add object pointed to by EBX to ST(0).
fstp QWORD PTR [eax] ; Copy ST(0) to object pointed to by EAX; pop ST(0).
add eax, 8 ; Point to next element of array a.
add ebx, 8 ; Point to next element of array b.
add_loop:
fld QWORD PTR [eax] ; Push object pointed to by EAX onto the FP stack.
fadd QWORD PTR [ebx] ; Add object pointed to by EBX to ST(0).
fstp QWORD PTR [eax] ; Copy ST(0) to object pointed to by EAX; pop ST(0).
fld QWORD PTR [eax+8] ; Repeat for next element.
fadd QWORD PTR [ebx+8]
fstp QWORD PTR [eax+8]
add eax, 16 ; Point to next element of array a.
add ebx, 16 ; Point to next element of array b.
dec ecx ; Decrement counter.
jnz add_loop ; If elements remain, then jump.
3 instructions
cycle
--------------------------------
x
iteration
7 instructions
--------------------------------
x
1 FADD
iteration
----------------------
3 FADDs
7 cycles
----------------------- 0.429 FADDs cycle==