General Optimization Guidelines 2
2-37
When moving data that is smaller than 64 bits between memory
locations, 64-bit or 128-bit SIMD register moves are more efficient (if
aligned) and can be used to avoid unaligned loads. Although
floating-point registers allow the movement of 64 bits at a time, floating
point instructions should not be used for this purpose, as data may be
inadvertently modified.
As an additional example, consider the cases in Example 2-16. In the
first case (A), there is a large load after a series of small stores to the
same area of memory (beginning at memory address
mem). The large
load will stall.
The
fld must wait for the stores to write to memory before it can
access all the data it requires. This stall can also occur with other data
types (for example, when bytes or words are stored and then words or
doublewords are read from the same area of memory).
In the second case (Example 2-16, B), there is a series of small loads
after a large store to the same area of memory (beginning at memory
address
mem). The small loads will stall.
The word loads must wait for the quadword store to write to memory
before they can access the data they require. This stall can also occur
with other data types (for example, when doublewords or words are
stored and then words or bytes are read from the same area of memory).
This can be avoided by moving the store as far from the loads as
possible.
Example 2-16 Large and Small Load Stalls
;A. Large load stall
mov mem, eax ; store dword to address “mem"
mov mem + 4, ebx ; store dword to address “mem + 4"
fld mem ; load qword at address “mem", stalls
;B. Small Load stall
fstp mem ; store qword to address “mem"
mov bx,mem+2 ; load word at address “mem + 2", stalls
mov cx,mem+4 ; load word at address “mem + 4", stalls