IA-32 Intel® Architecture Optimization
2-22
Inlining, Calls and Returns
The return address stack mechanism augments the static and dynamic
predictors to optimize specifically for calls and returns. It holds 16
entries, which is large enough to cover the call depth of most programs.
If there is a chain of more than 16 nested calls and more than 16 returns
in rapid succession, performance may be degraded.
The trace cache maintains branch prediction information for calls and
returns. As long as the trace with the call or return remains in the trace
cache and if the call and return targets remain unchanged, the depth
limit of the return address stack described above will not impede
performance.
To enable the use of the return stack mechanism, calls and returns must
be matched in pairs. If this is done, the likelihood of exceeding the
stack depth in a manner that will impact performance is very low.
Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near
calls must be matched with near returns, and far calls must be matched with
far returns. Pushing the return address on the stack and jumping to the routine
to be called is not recommended since it creates a mismatch in calls and
returns.
Calls and returns are expensive; use inlining for the following reasons:
• Parameter passing overhead can be eliminated.
• In a compiler, inlining a function exposes more opportunity for
optimization.
• If the inlined routine contains branches, the additional context of the
caller may improve branch prediction within the routine.
• A mispredicted branch can lead to larger performance penalties
inside a small function than if that function is inlined.
Assembly/Compiler Coding Rule 5. (MH impact, MH generality)
Selectively inline a function where doing so decreases code size or if the
function is small and the call site is frequently executed.