Intel IA-32 Computer Accessories User Manual


 
General Optimization Guidelines 2
2-7
Avoid longer latency instructions: integer multiplies and divides.
Replace them with alternate code sequences (e.g., use shifts instead
of multiplies).
Use the lea instruction and the full range of addressing modes to do
address calculation.
Some types of stores use more µops than others, try to use simpler
store variants and/or reduce the number of stores.
Avoid use of complex instructions that require more than 4 µops.
Avoid instructions that unnecessarily introduce dependence-related
stalls:
inc and dec instructions, partial register operations (8/16-bit
operands).
Avoid use of ah, bh, and other higher 8-bits of the 16-bit registers,
because accessing them requires a shift operation internally.
Use xor and pxor instructions to clear registers and break
dependencies for integer operations; also use
xorps and xorpd to
clear XMM registers for floating-point operations.
Use efficient approaches for performing comparisons.
Optimize Instruction Scheduling
Consider latencies and resource constraints.
Calculate store addresses as early as possible.
Enable Vectorization
Use the smallest possible data type. This enables more parallelism
with the use of a longer vector.
Arrange the nesting of loops so the innermost nesting level is free of
inter-iteration dependencies. It is especially important to avoid the
case where the store of data in an earlier iteration happens lexically
after the load of that data in a future iteration (called
lexically-backward dependence).