General Optimization Guidelines 2
2-101
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not
put more than four branches in 16-byte chunks. 2-22
Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not
put more than two end loop branches in a 16-byte chunk. 2-22
Assembly/Compiler Coding Rule 12. (M impact, MH generality) If the
average number of total iterations is less than or equal to 100, use a
forward branch to exit the loop. 2-23
Assembly/Compiler Coding Rule 13. (H impact, M generality) Unroll
small loops until the overhead of the branch and the induction variable
accounts, generally, for less than about 10% of the execution time of the
loop. 2-27
Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid
unrolling loops excessively, as this may thrash the trace cache or
instruction cache. 2-27
Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll
loops that are frequently executed and that have a predictable number of
iterations to reduce the number of iterations to 16 or fewer, unless this
increases code size so that the working set no longer fits in the trace
cache. If the loop body contains more than one conditional branch, then
unroll so that the number of iterations is 16/(# conditional branches).
2-27
Assembly/Compiler Coding Rule 16. (H impact, H generality) Align
data on natural operand size address boundaries. If the data will be
accesses with vector instruction loads and stores, align the data on
16-byte boundaries. 2-30
Assembly/Compiler Coding Rule 17. (H impact, M generality) Pass
parameters in registers instead of on the stack where possible. Passing
arguments on the stack is a case of store followed by a reload. While this
sequence is optimized in IA-32 processors by providing the value to the
load directly from the memory order buffer without the need to access the
data cache, floating point values incur a significant latency in forwarding.
Passing floating point argument in (preferably XMM) registers should
save this long latency operation. 2-33