General Optimization Guidelines 2
2-29
Memory Accesses
This section discusses guidelines for optimizing code and data memory
accesses. The most important recommendations are:
• align data, paying attention to data layout and stack alignment
• enable store forwarding
• place code and data on separate pages
• enhance data locality
• use prefetching and cacheability control instructions
• enhance code locality and align branch targets
• take advantage of write combining
Alignment and forwarding problems are among the most common
sources of large delays on the Pentium 4 processor.
Alignment
Alignment of data concerns all kinds of variables:
• dynamically allocated
• members of a data structure
• global or local variables
• parameters passed on the stack
Misaligned data access can incur significant performance penalties. This
is particularly true for cache line splits. The size of a cache line is
64 bytes in the Pentium 4, Intel Xeon, and Pentium M processors.
On the Pentium 4 processor, an access to data unaligned on 64-byte
boundary leads to two memory accesses and requires several µops to be
executed (instead of one). Accesses that span 64-byte boundaries are
likely to incur a large performance penalty, since they are executed near
retirement, and can incur stalls that are on the order of the depth of the
pipeline.