General Optimization Guidelines 2
2-57
• new cache line flush instruction
• new memory fencing instructions
For a detailed description of using cacheability instructions, see
Chapter 6.
Code Alignment
Because the trace cache (TC) removes the decoding stage from the
pipeline for frequently executed code, optimizing code alignment for
decoding is not as important for Pentium 4 and Intel Xeon processors.
For the Pentium M processor, code alignment and the alignment of
branch target will affect the throughput of the decoder.
Careful arrangement of code can enhance cache and memory locality.
Likely sequences of basic blocks should be laid out contiguously in
memory. This may involve pulling unlikely code, such as code to handle
error conditions, out of that sequence. See “Prefetching” section on how
to optimize for the instruction prefetcher.
Assembly/Compiler Coding Rule 29. (M impact, H generality) All branch
targets should be 16-byte aligned.
Assembly/Compiler Coding Rule 30. (M impact, H generality) If the body
of a conditional is not likely to be executed, it should be placed in another part
of the program. If it is highly unlikely to be executed and code locality is an
issue, the body of the conditional should be placed on a different code page.
Improving the Performance of Floating-point
Applications
When programming floating-point applications, it is best to start with a
high-level programming language such as C, C++ or Fortran. Many
compilers perform floating-point scheduling and optimization when it is
possible. However in order to produce optimal code, the compiler may
need some assistance.