IA-32 Intel® Architecture Optimization
2-58
Guidelines for Optimizing Floating-point Code
User/Source Coding Rule 10. (M impact, M generality) Enable the
compiler’s use of SSE, SSE2 or SSE3 instructions with appropriate switches.
Follow this procedure to investigate the performance of your
floating-point application:
• Understand how the compiler handles floating-point code.
• Look at the assembly dump and see what transforms are already
performed on the program.
• Study the loop nests in the application that dominate the execution
time.
• Determine why the compiler is not creating the fastest code.
• See if there is a dependence that can be resolved.
• Determine the problem area: bus bandwidth, cache locality, trace
cache bandwidth or instruction latency. Focus on optimizing the
problem area. For example, adding prefetch instructions will not
help if the bus is already saturated. If trace cache bandwidth is the
problem, added prefetch µops may degrade performance.
For floating-point coding, follow all the general coding
recommendations discussed in this chapter, including:
• blocking the cache
• using prefetch
• enabling vectorization
• unrolling loops
User/Source Coding Rule 11. (H impact, ML generality) Make sure your
application stays in range to avoid denormal values, underflows.
Out-of-range numbers cause very high overhead.
User/Source Coding Rule 12. (M impact, ML generality) Do not use double
precision unless necessary. Set the precision control (PC) field in the x87 FPU
control word to “Single Precision”. This allows single precision (32-bit)
computation to complete faster on some operations (for example, divides due