146 Scheduling Optimizations Chapter 7
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Complete Loop Unrolling
Complete loop unrolling eliminates the loop overhead completely by replacing the loop with copies of
the loop body.
Because complete loop unrolling removes the loop counter, it also reduces register pressure.
However, completely unrolling very large loops can result in the inefficient use of the L1 instruction
cache.
Example: Complete Loop Unrolling
In the following C code, the number of loop iterations is known at compile time and the loop body is
less than 100 instructions:
#define ARRAY_LENGTH 3
int sum, i, a[ARRAY_LENGTH];
...
sum = 0;
for (i = 0; i < ARRAY_LENGTH; i++) {
sum = sum + a[i];
}
To completely unroll an n-iteration loop, remove the loop control and replicate the loop body n times:
sum = 0;
sum = sum + a[0];
sum = sum + a[1];
sum = sum + a[2];
Partial Loop Unrolling
Partial loop unrolling reduces the loop overhead by duplicating the loop body several times, changing
the increment in the loop, and adding cleanup code to execute any leftover iterations of the loop. The
number of times the loop body is duplicated is known as the unroll factor.
However, partial loop unrolling may increase register pressure.
Example: Partial Loop Unrolling
In the following C code, each element of one array is added to the corresponding element of another
array:
double a[MAX_LENGTH], b[MAX_LENGTH];
for (i = 0; i < MAX_LENGTH; i++) {
a[i] = a[i] + b[i];
}