IA-32 Intel® Architecture Optimization
2-68
Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the
number of changes to the precision mode.
Improving Parallelism and the Use of FXCH
The x87 instruction set relies on the floating point stack for one of its
operands. If the dependence graph is a tree, which means each
intermediate result is used only once and code is scheduled carefully, it
is often possible to use only operands that are on the top of the stack or
in memory, and to avoid using operands that are buried under the top of
the stack. When operands need to be pulled from the middle of the
stack, an
fxch instruction can be used to swap the operand on the top of
the stack with another entry in the stack.
The
fxch instruction can also be used to enhance parallelism.
Dependent chains can be overlapped to expose more independent
instructions to the hardware scheduler. An
fxch instruction may be
required to effectively increase the register name space so that more
operands can be simultaneously live.
Note, however, that
fxch inhibits issue bandwidth in the trace cache. It
does this not only because it consumes a slot, but also because of issue
slot restrictions imposed on
fxch. If the application is not bound by
issue or retirement bandwidth,
fxch will have no impact.
The Pentium 4 processor’s effective instruction window size is large
enough to permit instructions that are as far away as the next iteration to
be overlapped. This often obviates the need to use
fxch to enhance
parallelism.
The
fxch instruction should be used only when it’s needed to express an
algorithm or to enhance parallelism. If the size of register name space is
a problem, the use of XMM registers is recommended (see the section).
Assembly/Compiler Coding Rule 34. (M impact, M generality) Use fxch
only where necessary to increase the effective name space.