IBM PPC440X5 Computer Hardware User Manual


 
User’s Manual
Preliminary PPC440x5 CPU Core
optimize.fm.
September 12, 2002
Page 569 of 589
Appendix B. PPC440x5 Core Compiler Optimizations
This appendix describes some potential optimizations for compilers.
1. Place target addresses (subroutine entry points) on cache line boundaries (32-bytes)
2. Up to five instructions between a load and a use of the load result. Assuming a data cache hit, the worst
case scenario for the PPC440x5 core is five instructions between a load-use, in order to avoid any bub-
bles. The five instructions are:
One dispatch, together with the load
Two the cycle after
Two the cycle after that
In the next cycle, the use of the load result can dispatch. Therefore, the compiler should try to schedule
as many as five instructions between the load and use of the load result. However, if some of the instruc-
tion pairs between the load-use have pipeline dependencies (such that they cannot dispatch together),
there is no benefit in including the extra instructions between the load-use, and other scheduling optimi-
zations could be made.
In the worst case of instruction pairings, the maximum performance can be achieved with only two
instructions between the load and use of the load result. This is the case when the load instruction pairs
with the instruction before it (instead of after it), and then the next two instructions require the same pipe,
so only one can dispatch during the cycle after the load, and then third instruction after the load needs the
same pipe as the second, so they cannot dispatch together either. In such a case, the third instruction
after the load might as well be the use of the load result. See item 3 for information about which instruc-
tion pairings can dispatch together.
3. Pair instructions for dual dispatch. The rules for instruction dispatch in the PPC440x5 core are as follows:
loads and stores can only use the L-Pipe. Branches, CR-updates, XER-updates (“o” forms of arithmetic
instructions), multiply, divide, system instructions (such as
rfi and sc), and any SPR accesses (mtspr,
mfspr) can only use the I-Pipe. All other instructions (primarily non-CR-updating and non-XER-updating
arithmetic and logic instructions) can use either the J-Pipe or the I-Pipe. Instructions should be paired so
that they can dispatch as pairs. For example, pair loads and stores with any other instructions. Pair CR-
updates with non-CR-updating instructions and so on.
4. Do not bother to try to schedule instructions between CR-updates and branches that are conditional on
those CR-updates (with some exceptions).
The exceptions are for CR-updates caused by multiply, divide, multiply-accumulate, mtcrf, tlbsx., and
stwcx. instructions. If a branch depends on the CR result of one of these instructions, one or more
instructions should be scheduled (if possible) between the CR update and the branch. Of course, it is also
the general case (as pointed out in item 3) that the compiler should schedule instructions so they can
issue in pairs, and a CR-update and a branch both issue to the I-Pipe, so they cannot issue together.
(The compiler should try to set things up so a CR-update and a following branch (regardless of any CR-
dependency by the branch) can issue in pairs.) This can mean the CR-update can get paired with the
instruction before it, and the branch with the instruction after it, such that there is dual issue in both
cycles. However, if this pairing is not possible, an instruction should be inserted (if possible, of course; do
not create no-ops for no reason) between the CR-update and the branch to allow the dual issue.
The point of this item is to explain that there is no need to separate the CR-update and the branch simply
for the sake of the CR-dependency. That is, there is no extra cycle penalty associated with the CR-
update/branch CR-dependency, beyond the “standard” penalty of the inability to dual issue, unless the
CR-update is one of the types mentioned above.