IA-32 Intel® Architecture Optimization
2-52
Minimizing Bus Latency
The system bus on Intel Xeon and Pentium 4 processors provides up to
6.4 GB/sec bandwidth of throughput at 200 MHz scalable bus clock
rate. (See MSR_EBC_FREQUENCY_ID register.) The peak bus
bandwidth is even higher with higher bus clock rates.
Each bus transaction includes the overhead of making request and
arbitrations. The average latency of bus read and bus write transactions
will be longer if reads and writes alternate. Segmenting reads and writes
into phases can reduce the average latency of bus transactions. This is
because the number of incidences of successive transactions involving a
read following a write or a write following a read are reduced.
User/Source Coding Rule 7. (M impact, ML generality) If there is a blend of
reads and writes on the bus, changing the code to separate these bus
transactions into read phases and write phases can help performance.
Note, however, that the order of read and write operations on the bus are
not the same as they appear in the program.
Bus latency of fetching a cache line of data can vary as a function of the
access stride of data references. In general, bus latency will increase in
response to increasing values of the stride of successive cache misses.
Independently, bus latency will also increase as a function of increasing
bus queue depths (the number outstanding bus requests of a given
transaction type). The combination of these two trends can be highly
non-linear, in that bus latency of large-stride, band-width sensitive
situations are such that effective throughput of the bus system for
data-parallel accesses can be significantly less than the effective
throughput of small-stride, bandwidth sensitive situations.
To minimize the per-access cost of memory traffic or amortize raw
memory latency effectively, software should control its cache miss
pattern to favor higher concentration of smaller-stride cache misses.