126 Branch Optimizations Chapter 6
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
6.1 Density of Branches
Optimization
When possible, align branches such that they do not cross a 16-byte boundary.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale
The AMD Athlon™ 64 and AMD Opteron™ processors have the capability to cache branch-
prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or
returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in
the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte
boundary cannot dispatch with any other instructions when it is predicted taken. Perform this
alignment by rearranging code; it is not beneficial to align branches using padding sequences.
The following branches are limited to three per 16-byte window:
j
cc
rel8
j
cc
rel32
jmp
rel8
jmp
rel32
jmp
reg
jmp WORD PTR
jmp DWORD PTR
call
rel16
call
r/m16
call
rel32
call
r/m32
Coding more than three branches in the same 16-byte code window may lead to conflicts in the
branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three