IBM PPC440X5 Computer Hardware User Manual


 
User’s Manual
PPC440x5 CPU Core Preliminary
Page 570 of 589
optimize.fm.
September 12, 2002
If the CR-update is MAC or a 16 × 32 multiply, 1 to 3 instructions should be scheduled between the CR-
update and the branch (0 or 1 instruction, depending on whether the CR-update pairs with the instruction
before or after, or 1 to 2 instructions to issue between the issue of the CR-update and the issue of the
branch, depending on whether there is a single-issue or dual-issue opportunity for the instruction(s)
which are scheduled between the CR-update and the branch).
Similarly, if the CR-update is 32 × 32 multiply, divide, tlbsx., or stwcx., schedule 3 to 5 instructions
between the CR-update and the branch (two issue cycles of 2 to 4 instructions between, plus the 0 to 1
issuing with the CR-update).
Finally, if the CR-update is
mtcrf, schedule 5 to 7 instructions between (3 cycles of issue between them).
5. Avoid the use of string/multiple instructions (with some exceptions).
The exceptions have to do with cache effects (more cache misses due to more instructions if you use
separate loads/stores instead of a string/multiple), and the specialized behavior of a string, where the
bytes are inserted into the more-significant portion of the GPR, in preparation for a “string compare” oper-
ation to determine which string is “greater” than another. If the string/multiple is for a relatively small num-
ber of registers (or the expansion into discrete loads/stores is known to not have an overall detrimental
cache impact), and if a string is being used only for a copy operation and the size is known, performance
can be improved by using discrete loads/stores. Essentially, due to hazard determination within the pro-
cessor, string/multiples impose a couple of cycles of extra, “false” penalty on both the front-end and the
back-end. On the other hand, if this penalty is amortized over a large number of registers (say 16 or so),
the impact of the extra stalls is probably negligible.
6. Insert 10 or so instructions within a
bdnz loop (loop unrolling).
7. Put 4 to 8 instructions between mtlr/mtctr and blr/bctr
8. Put 1 to 3 instructions between 16 × 32 multiply and the use of the result.
9. Put 2 to 5 instructions between 32 × 32 multiply and the use of the result.
10. Use the “without allocate” attribute appropriately on block copy operations, such as calls to the library
memcpy function, or implicit structure copies.
11. Block move operations. If moving a block of memory using a series of load/store operations, perform the
load/store operations in the following order: L1-L2-L3-S1-S2-S3, and repeat. Having the second and third
loads between the first load and the first store fills the two-cycle load-use penalty.