IA-32 Intel® Architecture Optimization
1-16
To take advantage of the forward-not-taken and backward-taken static
predictions, code should be arranged so that the likely target of the
branch immediately follows forward branches (see also: “Branch
Prediction” in Chapter 2).
Branch Target Buffer. Once branch history is available, the Pentium 4
processor can predict the branch outcome even before the branch
instruction is decoded. The processor uses a branch history table and a
branch target buffer (collectively called the BTB) to predict the
direction and target of branches based on an instruction’s linear address.
Once the branch is retired, the BTB is updated with the target address.
Return Stack. Returns are always taken; but since a procedure may be
invoked from several call sites, a single predicted target does not suffice.
The Pentium 4 processor has a Return Stack that can predict return
addresses for a series of procedure calls. This increases the benefit of
unrolling loops containing function calls. It also mitigates the need to
put certain procedures inline since the return penalty portion of the
procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly
predicted, a taken branch may reduce available parallelism in a typical
processor (since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does
not end the line and target does not begin the line). The branch predictor
allows a branch and its target to coexist in a single trace cache line,
maximizing instruction delivery from the front end.
Execution Core Detail
The execution core is designed to optimize overall performance by
handling common cases most efficiently. The hardware is designed to
execute frequent operations in a common context as fast as possible, at
the expense of infrequent operations using rare contexts.