Intel IA-32 Computer Accessories User Manual


 
General Optimization Guidelines 2
2-27
The Pentium 4 processor can correctly predict the exit branch for an
inner loop that has 16 or fewer iterations, if that number of iterations
is predictable and there are no conditional branches in the loop.
Therefore, if the loop body size is not excessive, and the probable
number of iterations is known, unroll inner loops until they have a
maximum of 16 iterations. With the Pentium M processor, do not
unroll loops more than 64 iterations.
The potential costs of unrolling loops are:
Excessive unrolling, or unrolling of very large loops can lead to
increased code size. This can be harmful if the unrolled loop no
longer fits in the trace cache (TC).
Unrolling loops whose bodies contain branches increases demands
on the BTB capacity. If the number of iterations of the unrolled loop
is 16 or less, the branch predictor should be able to correctly predict
branches in the loop body that alternate direction.
Assembly/Compiler Coding Rule 13. (H impact, M generality) Unroll small
loops until the overhead of the branch and the induction variable accounts,
generally, for less than about 10% of the execution time of the loop.
Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid
unrolling loops excessively, as this may thrash the trace cache or instruction
cache.
Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll
loops that are frequently executed and that have a predictable number of
iterations to reduce the number of iterations to 16 or fewer, unless this
increases code size so that the working set no longer fits in the trace cache or
instruction cache. If the loop body contains more than one conditional branch,
then unroll so that the number of iterations is 16/(# conditional branches).
Example 2-10 shows how unrolling enables other optimizations.