IA-32 Intel® Architecture Optimization
2-72
Floating-Point Stalls
Floating-point instructions have a latency of at least two cycles. But,
because of the out-of-order nature of Pentium II and the subsequent
processors, stalls will not necessarily occur on an instruction or µop
basis. However, if an instruction has a very long latency such as an
fdiv, then scheduling can improve the throughput of the overall
application.
x87 Floating-point Operations with Integer Operands
For Pentium 4 processor, splitting floating-point operations (fiadd,
fisub, fimul, and fidiv) that take 16-bit integer operands into two
instructions (
fild and a floating-point operation) is more efficient.
However, for floating-point operations with 32-bit integer operands,
using
fiadd, fisub, fimul, and fidiv is equally efficient compared
with using separate instructions.
Assembly/Compiler Coding Rule 36. (M impact, L generality) Try to use
32-bit operands rather than 16-bit operands for fild. However, do not do so
at the expense of introducing a store forwarding problem by writing the two
halves of the 32-bit memory operand separately.
x87 Floating-point Comparison Instructions
On Pentium II and the subsequent processors, the fcomi and fcmov
instructions should be used when performing floating-point
comparisons. Using (
fcom, fcomp, fcompp) instructions typically
requires additional instruction like
fstsw. The latter alternative causes
more
μops to be decoded, and should be avoided.
Transcendental Functions
If an application needs to emulate math functions in software due to
performance or other reasons (see the “Guidelines for Optimizing
Floating-point Code” section), it may be worthwhile to inline math
library calls because the
call and the prologue/epilogue involved with
such calls can significantly affect the latency of operations.