IA-32 Intel® Architecture Optimization
B-8
Microarchitecture Notes
Trace Cache Events
The trace cache is not directly comparable to an instruction cache. The
two are organized very differently. For example, a trace can span many
lines' worth of instruction-cache data. As with most microarchitectural
elements, trace cache performance is only an issue if something else is
not a bigger bottleneck. If an application is bus bandwidth bound, the
bandwidth that the front end is getting uops to the core may be
irrelevant. When front-end bandwidth is an issue, the trace cache, in
deliver mode, can issue uops to the core faster than either the decoder
(build mode) or the microcode store (the MS ROM). Thus the percent of
time in trace cache deliver mode, or similarly, the percentage of all
bogus and non-bogus uops from the trace cache can be a useful metric
for determining front-end performance.
The metric that is most analogous to an instruction cache miss is a trace
cache miss. An unsuccessful lookup of the trace cache (colloquially, a
miss) is not interesting, per se, if we are in build mode and don't find a
trace available; we just keep building traces. The only “penalty” in that
case is that we continue to have a lower front-end bandwidth. The trace
cache miss metric that is currently used is not just any TC miss, but
rather one that is incurred while the machine is already in deliver mode;
i.e., when a 15-20 cycle penalty is paid. Again, care must be exercised:
a small average number of TC misses per instruction does not indicate
good front-end performance if the percentage of time in deliver mode is
also low.
Bus and Memory Metrics
In order to correctly interpret the observed counts of performance
metrics related to bus events, it is helpful to understand transaction
sizes, when entries are allocated in different queues, and how sectoring
and prefetching affect counts.