Intel IA-32 Computer Accessories User Manual


 
General Optimization Guidelines 2
2-93
Because micro-ops are delivered from the trace cache in the common
cases, decoding rules are not required.
Scheduling Rules for the Pentium M Processor Decoder
The Pentium M processor has three decoders, but the decoding rules to
supply micro-ops at high bandwidth are less stringent than those of the
Pentium III processor. This provides an opportunity to build a front-end
tracker in the compiler and try to schedule instructions correctly. The
decoder limitations are as follows:
The first decoder is capable of decoding one macroinstruction made
up of four or fewer micro-ops In each clock cycle. It can handle any
number of bytes up to the maximum of 15. Multiple prefix
instructions require additional cycles.
The two additional decoders can each decode one macroinstruction
per clock cycle (assuming the instruction is one micro-op up to
seven bytes in length).
Instructions composed of more than four micro-ops take multiple
cycles to decode.
Assembly/Compiler Coding Rule 63. (M impact, M generality) Avoid
putting explicit references to ESP in a sequence of stack operations (POP, PUSH,
CALL, RET).
Vectorization
This section provides a brief summary of optimization issues related to
vectorization. Chapters 3, 4 and 5 provide greater detail.
Vectorization is a program transformation which allows special
hardware to perform the same operation of multiple data elements at the
same time. Successive processor generations have provided vector
support through the MMX technology, Streaming SIMD Extensions
technology and Streaming SIMD Extensions 2. Vectorization is a
special case of SIMD, a term defined in Flynn’s architecture taxonomy
to denote a Single Instruction stream capable of operating on Multiple