Support User Manuals

AMD 250 Computer Hardware User Manual

Open as PDF

of 384

54 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.27 Speeding Up Branches Based on Comparisons

Between Floats

Optimization

Store operands of type float into a memory location and use integer comparison with the memory

location to perform fast branches in cases where compilers do not support fast floating-point

comparison instructions or 3DNow! code generation.

Application

This optimization applies to 32-bit software.

Rationale

Branches based on floating-point comparisons are often slow. The AMD Athlon 64 and

AMD Opteron processors support the FCOMI, FUCOMI, FCOMIP, and FUCOMIP instructions that

allow implementation of fast branches based on comparisons between operands of type double or

type float. However, many compilers do not support generating these instructions. Likewise,

floating-point comparisons between operands of type float can be accomplished quickly by using

the 3DNow! PFCMP instruction if the compiler supports 3DNow! code generation.

Many compilers only implement branches based on floating-point comparisons by using FCOM or

FCOMP to compare the floating-point operands, followed by FSTSW AX in order to transfer the x87

condition-code flags into EAX. The subsequent branch is then based on the contents of the EAX

register. Although the AMD Athlon 64 and AMD Opteron processors have acceleration hardware to

speed up the FSTSW instruction, this process is still fairly slow.

Branches Dependent on Integer Comparisons Are Fast

One alternative for branches dependent upon the outcome of the comparison of operands of type

float is to store the operand(s) into a memory location and then perform an integer comparison with

that memory location. Branches dependent on integer comparisons are very fast. It should be noted

that the replacement code uses a load dependent on an immediately prior store. If the store is not

doubleword-aligned, no store-to-load-forwarding takes place, and the branch is still slow. Also, if

there is a lot of activity in the load-store queue, forwarding of the store data may be somewhat

delayed, thus negating some of the advantages of using the replacement code. It is recommended that

you experiment with the replacement code to test whether it actually provides a performance increase

in the code at hand.

The replacement code works well for comparisons against zero, including correct behavior when

encountering a negative zero as allowed by the IEEE-754 standard. It also works well for comparing

previous next