IA-32 Intel® Architecture Optimization
2-32
Store Forwarding
The processor’s memory system only sends stores to memory (including
cache) after store retirement. However, store data can be forwarded
from a store to a subsequent load from the same address to give a much
shorter store-load latency.
There are two kinds of requirements for store forwarding. If these
requirements are violated, store forwarding cannot occur and the load
must get its data from the cache (so the store must write its data back to
the cache first). This incurs a penalty that is related to pipeline depth.
The first requirement pertains to the size and alignment of the
store-forwarding data. This restriction is likely to have high impact to
overall application performance. Typically, performance penalty due to
violating this restriction can be prevented. Several examples of coding
pitfalls that cause store-forwarding stalls and solutions to these pitfalls
are discussed in detail in the “Store-to-Load-Forwarding Restriction on
Size and Alignment” section. The second requirement is the availability
of data, discussed in the “Store-forwarding Restriction on Data
Availability” section.
A good practice is to eliminate redundant load operations, see some
guidelines below.
It may be possible to keep a temporary scalar variable in a register and
never write it to memory. Generally, such a variable must not be
accessible via indirect pointers. Moving a variable to a register
eliminates all loads and stores of that variable and eliminates potential
problems associated with store forwarding. However, it also increases
register pressure.
Load instructions tend to start chains of computation. Since the out of
order engine is based on data dependence, load instructions play a
significant role in the engine capability to execute at a high rate.
Eliminating loads should be given a high priority.