AMD 250 Computer Hardware User Manual


 
Chapter 4 Instruction-Decoding Optimizations 71
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
Chapter 4 Instruction-Decoding
Optimizations
The optimizations in this chapter are designed to help maximize the number of instructions that the
processor can decode at one time.
The instruction fetcher of both the AMD Athlon™ 64 and AMD Opteron™ processors reads 16-byte
packets from the L1 instruction cache. These packets are 16-byte aligned. The instruction bytes are
then merged into a 32-byte pick window. On each cycle, the in-order front-end engine selects up to
three AMD64 instructions for decode from the pick window.
This chapter covers the following topics:
Topic Page
DirectPath Instructions 72
Load-Execute Instructions 73
Load-Execute Integer Instructions 73
Load-Execute Floating-Point Instructions with Floating-Point Operands 74
Load-Execute Floating-Point Instructions with Integer Operands 74
Branch Targets in Program Hot Spots 76
32/64-Bit vs. 16-Bit Forms of the LEA Instruction 77
Short Instruction Encodings 80
Partial-Register Reads and Writes 81
Using LEAVE for Function Epilogues 83
Alternatives to SHLD Instruction 85
8-Bit Sign-Extended Immediate Values 87
8-Bit Sign-Extended Displacements 88
Code Padding with Operand-Size Override and NOP 89