Compaq ECQD2KCTE Laptop User Manual


 
A–2 Alpha Architecture Handbook
In some cases, there are performance advantages to aligning instructions or data to cache-block
boundaries, or putting data whose use is correlated into the same cache block, or trying to
avoid cache conflicts by not having data whose use is correlated placed at addresses that are
equal modulo the cache size. Since the Alpha architecture will have many implementations, an
exact cache design cannot be outlined here.
In each case below, the performance implication is given by an order-of-magnitude number: 1,
3, 10, 30, or 100. A factor of 10 means that the performance difference being discussed will
likely range from 3 to 30 across all Alpha implementations.
A.2 Instruction-Stream Considerations
The following sections describe considerations for the instruction stream.
A.2.1 Instruction Alignment
Code PSECTs should be octaword aligned. Targets of frequently taken branches should be at
least quadword aligned, and octaword aligned for very frequent loops. Compilers could use
execution profiles to identify frequently taken branches.
Quadword I-fetch implementors should give first priority to executing aligned quadwords
quickly. Octaword-fetch implementors should give first priority to executing aligned octa-
words quickly, and second priority to executing aligned quadwords quickly. Dual-issue
implementations should give first priority to issuing both halves of an aligned quadword in one
cycle, and second priority to buffering and issuing other combinations.
A.2.2 Branch Prediction and Minimizing Branch-Taken — Factor of 3
In many Alpha implementations, an unexpected change in I-stream address will result in about
10 lost instruction times. "Unexpected" may mean any branch-taken or may mean a mispre-
dicted branch. In many implementations, even a correctly predicted branch to a quadword
target address will be slower than straight-line code.
Compilers should follow these rules to minimize unexpected branches:
1. Branch prediction is implementation specific. Based on execution profiles, compilers
should physically rearrange code so that it has matching behavior.
2. Make basic blocks as big as possible. A good goal is 20 instructions on average
between branch-taken. This requires unrolling loops so that they contain at least 20
instructions, and putting subroutines of less than 20 instructions directly in line. It also
requires using execution profiles to rearrange code so that the frequent case of a condi-
tional branch falls through. For very high-performance loops, it will be profitable to
move instructions across conditional branches to fill otherwise wasted instruction issue
slots, even if the instructions moved will not always do useful work. Note that using the
Conditional Move instructions can sometimes avoid breaking up basic blocks.
3. In an if-then-else construct whose execution profile is skewed even slightly away from
50%-50% (51-49 is enough), put the infrequent case completely out of line, so that the
frequent case encounters zero branch-takens, and the infrequent case encounters two