Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
1-28
The fetch and decode unit includes a hardware instruction prefetcher
and three decoders that enable parallelism. It also provides a 32KB
instruction cache that stores un-decoded binary instructions.
The instruction prefetcher fetches instructions in a linear fashion from
memory if the target instructions are not already in the instruction cache.
The prefetcher is designed to fetch efficiently from an aligned 16-byte
block. If the modulo 16 remainder of a branch target address is 14, only
two useful instruction bytes are fetched in the first cycle. The rest of the
instruction bytes are fetched in subsequent cycles.
The three decoders decode IA-32 instructions and break them down into
micro-ops (µops). In each clock cycle, the first decoder is capable of
decoding an instruction with four or fewer µops. The remaining two
decoders each decode a one µop instruction in each clock cycle.
The front end can issue multiple µops per cycle, in original program
order, to the out-of-order core.
The Intel Pentium M processor incorporates sophisticated branch
prediction hardware to support the out-of-order core. The branch
prediction hardware includes dynamic prediction, and branch target
buffers.
The Intel Pentium M processor has enhanced dynamic branch prediction
hardware. Branch target buffers (BTB) predict the direction and target
of branches based on an instruction’s address.
The Pentium M Processor includes two techniques to reduce the
execution time of certain operations:
ESP Folding. This eliminates the ESP manipulation
micro-operations in stack-related instructions such as PUSH, POP,
CALL and RET. It increases decode rename and retirement
throughput. ESP folding also increases execution bandwidth by
eliminating µops which would have required execution resources.