Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

148 Scheduling Optimizations Chapter 7

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

The unrolled loop consists of 10 instructions. Based on the decode/retire bandwidth of three

instructions per cycle, this loop goes no faster than three iterations in 10 cycles (which is equivalent to

6/10 of a floating-point add per cycle because there are two additions per iteration), or 1.4 times as

fast as the original loop.

Deriving the Loop Control for Partially Unrolled Loops

A frequently used loop construct is a counting loop. In a typical case, the loop count starts at some

lower bound (low), increases by some fixed, positive increment (inc) for each iteration of the loop,

and may not exceed some upper bound (high):

for (k = low; k <= high; k += inc) {

x[k] = ...

}

The following code shows how to partially unroll such a loop by an unroll factor (factor) and how to

derive the loop control for the partially unrolled version of the loop:

for (k = low; k <= (high - (factor - 1) * inc); k += factor * inc) {

// Begin the series of unrolled statements.

x[k + 0 * inc] = ...

// Continue the series if the unrolling factor is greater than 2.

x[k + 1 * inc] = ...

x[k + 2 * inc] = ...

...

// End the series.

x[k + (factor - 1) * inc] = ...

}

// Handle the end cases.

for (k = k; k <= high; k += inc) {

x[k] = ...

}

Related Information

For information on loop unrolling at the C-source level, see “Unrolling Small Loops” on page 13.

3 instructions

cycle

--------------------------------

x

iteration

10 instructions

-----------------------------------

x

2 FADDs

iteration

-----------------------

6 FADDs

10 cycles

----------------------- 0.600 FADDs cycle⁄==

previous next