General Optimization Guidelines 2
2-91
Using memory as a destination operand may further reduce register
pressure at the slight risk of making trace cache packing more difficult.
On the Pentium 4 processor, the sequence of loading a value from
memory into a register and adding the results in a register to memory is
faster than the alternate sequence of adding a value from memory to a
register and storing the results in a register to memory. The first
sequence also uses one less μop than the latter.
Assembly/Compiler Coding Rule 59. (ML impact, M generality) Give
preference to adding a register to memory (memory is the destination) instead
of adding memory to a register. Also, give preference to adding a register to
memory over loading the memory, adding two registers and storing the result.
Assembly/Compiler Coding Rule 60. (M impact, M generality) When an
address of a store is unknown, subsequent loads cannot be scheduled to
execute out of order ahead of the store, limiting the out of order execution of
the processor. When an address of a store is computed by a potentially long
latency operation (such as a load that might miss the data cache) attempt to
reorder subsequent loads ahead of the store.
Instruction Scheduling
Ideally, scheduling or pipelining should be done in a way that optimizes
performance across all processor generations. This section presents
scheduling rules that can improve the performance of your code on the
Pentium 4 processor.
Latencies and Resource Constraints
Assembly/Compiler Coding Rule 61. (M impact, MH generality) Calculate
store addresses as early as possible to avoid having stores block loads.
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form
LOAD reg1, mem1
... code that does not write to reg1...
OP reg2, reg1
... code that does not use reg1 ...