Intel IA-32 Computer Accessories User Manual


 
IA-32 Instruction Latency and Throughput C
C-3
While several items on the above list involve selecting the right
instruction, this appendix focuses on the following issues. These are
listed in an expected priority order, though which item contributes most
to performance will vary by application.
Maximize the flow of μops into the execution core. IA-32
instructions which consist of more than four μops require additional
steps from microcode ROM. These instructions with longer μop
flows incur a delay in the front end and reduce the supply of uops to
the execution core. In Pentium 4 and Intel Xeon processors,
transfers to microcode ROM often reduce how efficiently μops can
be packed into the trace cache. Where possible, it is advisable to
select instructions with four or fewer μops. For example, a 32-bit
integer multiply with a memory operand fits in the trace cache
without going to microcode, while a 16-bit integer multiply to
memory does not.
Avoid resource conflicts. Interleaving instructions so that they don’t
compete for the same port or execution unit can increase
throughput. For example, alternating
PADDQ and PMULUDQ, each have
a throughput of one issue per two clock cycles. When interleaved,
they can achieve an effective throughput of one instruction per cycle
because they use the same port but different execution units.
Selecting instructions with fast throughput also helps to preserve
issue port bandwidth, hide latency and allows for higher software
performance.
Minimize the latency of dependency chains that are on the critical
path. For example, an operation to shift left by two bits executes
faster when encoded as two adds than when it is encoded as a shift.
If latency is not an issue, the shift results in a denser byte encoding.
In addition to the general and specific rules, coding guidelines and the
instruction data provided in this manual, you can take advantage of the
software performance analysis and tuning toolset available at
http://developer.intel.com/software/products/index.htm
. The tools
include the VTune Performance Analyzer, with its performance-
monitoring capabilities.