General Optimization Guidelines 2
2-11
The VTune Performance Analyzer also enables engineers to use these
counters to measure a number of workload characteristics, including:
• retirement throughput of instruction execution as an indication of
the degree of extractable instruction-level parallelism in the
workload,
• data traffic locality as an indication of the stress point of the cache
and memory hierarchy,
• data traffic parallelism as an indication of the degree of
effectiveness of amortization of data access latency.
Note that improving performance in one part of the machine does not
necessarily bring significant gains to overall performance. It is possible
to degrade overall performance by improving performance for some
particular metric.
Where appropriate, coding recommendations in this chapter include
descriptions of the VTune analyzer events that provide measurable data
of performance gain achieved by following recommendations. Refer to
the VTune analyzer online help for instructions on how to use the tool.
VTune analyzer events include the Pentium 4 processor performance
metrics described in Appendix B, “Using Performance Monitoring
Events.”
Processor Perspectives
The majority of the coding recommendations for the Pentium 4 and
Intel Xeon processors also apply to Pentium M, Intel Core Solo, and
Intel Core Duo processors. However, there are situations where a
recommendation may benefit one microarchitecture more than the other.
The most important of these are:
• Instruction decode throughput is important for the Pentium M, Intel
Core Solo, and Intel Core Duo processors but less important for the
Pentium 4 and Intel Xeon processors. Generating code with the
4-1-1 template (instruction with four μops followed by two
instructions with one μop each) helps the Pentium M processor.