164 Integer Optimizations Chapter 8
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
8.2 Alternative Code for Multiplying by a Constant
Optimization
Devise instruction sequences with lower latency to accomplish multiplication by certain constant
multipliers.
Rationale
A 32-bit integer multiplied by a constant has a latency of 3 cycles; a 64-bit integer multiplied by a
constant has a latency of 4 cycles. For certain constant multipliers, instruction sequences can be
devised that accomplish the multiplication with lower latency. Because the AMD Athlon 64 and
AMD Opteron processors contain only one integer multiplier but three integer execution units, the
replacement code can provide better throughput as well.
Most replacement sequences require the use of an additional temporary register, thus increasing
register pressure. If register pressure in a piece of code that performs integer multiplication with a
constant is already high, it could be better for the overall performance of that code to use the IMUL
instruction instead of the replacement code. Similarly, replacement sequences with low latency but
containing many instructions may negatively influence decode bandwidth as compared to the IMUL
instruction. In general, replacement sequences containing more than four instructions are not
recommended.
The following code samples are designed for the original source to receive the final result. Other
sequences are possible if the result is in a different register. Sequences that do not require a temporary
register are favored over ones requiring a temporary register, even if the latency is higher. Arithmetic-
logic-unit operations are preferred over shifts to keep code size small. Similarly, both arithmetic-
logic-unit operations and shifts are favored over the LEA instruction.
There are improvements in the AMD Athlon 64 and AMD Opteron processors’ multiplier over that of
previous x86 processors. For this reason, when doing 32-bit multiplication, only use the alternative
sequence if the alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit
multiplication, only use the alternative sequence if the alternative sequence has a latency that is less
than or equal to 3 cycles.
Examples
by 2: add
reg1
,
reg1
; 1 cycle
by 3: lea
reg1
, [
reg1
+
reg1
*2] ; 2 cycles
by 4: shl
reg1
, 2 ; 1 cycle
by 5: lea
reg1
, [
reg1
+
reg1
*4] ; 2 cycles