Intel IA-32 Computer Accessories User Manual


 
Optimizing Cache Usage 6
6-43
The choice of single-pass or multi-pass can have a number of
performance implications. For instance, in a multi-pass pipeline, stages
that are limited by bandwidth (either input or output) will reflect more
of this performance limitation in overall execution time. In contrast, for
a single-pass approach, bandwidth-limitations can be distributed/
amortized across other computation-intensive stages. Also, the choice of
which prefetch hints to use are also impacted by whether a single-pass
or multi-pass approach is used (see “Hardware Prefetching of Data”).
Memory Optimization using Non-Temporal Stores
The non-temporal stores can also be used to manage data retention in
the cache. Uses for the non-temporal stores include:
To combine many writes without disturbing the cache hierarchy.
To manage which data structures remain in the cache and which are
transient.
Detailed implementations of these usage models are covered in the
following sections.
Non-temporal Stores and Software Write-Combining
Use non-temporal stores in the cases when the data to be stored is:
write-once (non-temporal)
too large and thus cause cache thrashing
Non-temporal stores do not invoke a cache line allocation, which means
they are not write-allocate. As a result, caches are not polluted and no
dirty writeback is generated to compete with useful data bandwidth.
Without using non-temporal stores, bus bandwidth will suffer when
caches start to be thrashed because of dirty writebacks.
In Streaming SIMD Extensions implementation, when non-temporal
stores are written into writeback or write-combining memory regions,
these stores are weakly-ordered and will be combined internally inside
the processor’s write-combining buffer and be written out to memory as