Intel IA-32 Computer Accessories User Manual


 
Using Performance Monitoring Events B
B-15
Usage Notes on Bus Activities
A number of performance metrics in Table B-1 are based on
IOQ_active_entries and BSQ_active entries. The next three paragraphs
provide information of various bus transaction underway metrics. These
metrics nominally measure the end-to-end latency of transactions
entering the BSQ; i.e., the aggregate sum of the allocation-to-
deallocation durations for the BSQ entries used for all individual
transaction in the processor. They can be divided by the corresponding
number-of-transactions metrics (i.e., those that measure allocations) to
approximate an average latency per transaction. However, that
approximation can be significantly higher than the number of cycles it
takes to get the first chunk of data for the demand fetch (e.g., load),
because the entire transaction must be completed before deallocation.
That latency includes deallocation overheads, and the time to get the
other half of the 128-byte line, which is called an adjacent-sector
prefetch. Since adjacent-sector prefetches have lower priority than
demand fetches, there is a high probability on a heavily utilized system
that the adjacent-sector prefetch will have to wait until the next bus
arbitration cycle from that processor. Note also that on current
implementations, the granularities at which BSQ_allocation and
BSQ_active_entries count can differ, leading to a possible 2-times
overcounting of latencies for non-partial programmatic loads.
Users of the bus transaction underway metrics would be best served by
employing them for relative comparisons across BSQ latencies of all
transactions. Users that want to do cycle-by-cycle or type-by-type
analysis should be aware that this event is known to be inaccurate for
“UC Reads Chunk Underway” and “Write WC partial underway”
metrics. Relative changes to the average of all BSQ latencies should be
viewed as an indication that overall memory performance has changed.
That memory performance change may or may not be reflected in the
measured FSB latencies.
Also note that for Pentium 4 and Intel Xeon Processor implementations
with an integrated 3rd-level cache, BSQ entries are allocated for all
2nd-level writebacks (replaced lines), not just those that become bus