50 C and C++ Source-Level Optimizations Chapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
2.25 Accelerating Floating-Point Division and Square
Root
Optimization
In applications that involve the heavy use of single precision division and square root operations, it is
recommended that you port the code to SSE or 3DNow!™ inline assembly or use a compiler that can
generate SSE or 3DNow! technology code. If neither of these methods are possible, the x87 FPU
control word register precision control specification bits (PC) can be set to single precision to improve
performance. (The processor defaults to double-extended precision. See AMD64 Architecture
Programmer’s Manual Volume 1: Application Programming (order# 24592) for details on the FPU
control register.)
Application
This optimization applies to 32-bit software.
Rationale
Division and square root have a much longer latency than other floating-point operations, even though
the AMD Athlon 64 and AMD Opteron processors provide significant acceleration of these two
operations. In some application programs, these operations occur so often as to seriously impact
performance. If code has hot spots that use single precision arithmetic only (that is, all computation
involves data of type float) and for some reason cannot be ported to 3DNow! code, the following
technique may be used to improve performance.
The x87 FPU has a precision-control field as part of the FPU control word. The precision-control
setting determines rounding precision of instruction results and affects the basic arithmetic
operations, including division and the extraction of square root. Division and square root on the
AMD Athlon 64 and AMD Opteron processors are only computed to the number of bits necessary for
the currently selected precision. Setting precision control to single precision (versus the Win32
default of double precision) lowers the latency of those operations.
The Microsoft
®
Visual C environment provides functions to manipulate the FPU control word and
thus the precision control. Note that these functions are not very fast, so insert changes of precision
control where it creates little overhead, such as outside a computation-intensive loop. Otherwise, the
overhead created by the function calls outweighs the benefit from reducing the latencies of divide and
square-root operations. For more information on this topic, see AMD64 Architecture Programmer's
Manual Volume 1: Application Programming (order# 24592).
The following example shows how to set the precision control to single precision and later restore the
original settings in the Microsoft Visual C environment.