IA-32 Intel® Architecture Optimization
A-12
duration of read traffic compared to the duration of the workload is
significantly less than unity, it indicates the dominant data locality of the
workload is cache access traffic.
Average Bus Queue Depth: Using the default configuration of the
processor event “Bus Reads Underway from the Processor”
2
, one can
measure the weighted cycles of bus read traffic, where it is weighted by
the depth of queue of bus reads. Thus, one can derive average queue
depth from the ratio of weighted cycles over the effective duration of
bus read traffic. Similarly, one can use other bus events to measure the
average bus queue depth for other type of bus transactions.
Using the average queue depth of read traffic, one can characterize the
degree of bus read traffic that originates from the cache miss pattern of
the workload and are sensitive memory latency. This can be done by
comparing whether the average bus queue depth of read traffic, under
the condition of disabling hardware prefetch
3
while measuring this bus
event, is close to unity. When this ratio is very close to unity, it implies
the workload has a data access pattern with very poor data parallelism
and will fully exposes memory latency whenever cache misses occur.
Large Stride Inefficiency: Large-stride data accesses are much less
efficient than smaller stride data accesses, because large stride accesses
will incur more frequent DTLB misses during address translation. The
penalty of large stride accesses apply to cache traffic as well as memory
traffic. In terms of the quantitative impact on data access latency, large
2. Note that by default Pentium 4 processor events dealing with bus traffic, such as Bus Reads
Underway from the Processor, will implicitly combine the interaction of two aspects: (a) a
cache miss pattern of the last level cache as a result of the data reference pattern of the
workload, each cache read miss is expected to require a bus read request to fetch data (a
cache line) from the memory sub-system; (b) in the presence of hardware prefetch being
enabled, a cache read miss may trigger the hardware prefetch to queue up additional bus
read requests to fetch additional cache lines from the memory sub-system.
3. Hardware prefetch mechanisms can be controlled on demand using the model-specific
register IA32_MISC_ENABLES. See Appendix B of the IA-32 Intel® Architecture
Software Developer’s Manual, Volume 3B describes the specific bit locations of the
IA32_MISC_ENABLES MSR.