Optimizing Cache Usage 6
6-43
The choice of single-pass or multi-pass can have a number of
performance implications. For instance, in a multi-pass pipeline, stages
that are limited by bandwidth (either input or output) will reflect more
of this performance limitation in overall execution time. In contrast, for
a single-pass approach, bandwidth-limitations can be distributed/
amortized across other computation-intensive stages. Also, the choice of
which prefetch hints to use are also impacted by whether a single-pass
or multi-pass approach is used (see “Hardware Prefetching of Data”).
Memory Optimization using Non-Temporal Stores
The non-temporal stores can also be used to manage data retention in
the cache. Uses for the non-temporal stores include:
• To combine many writes without disturbing the cache hierarchy.
• To manage which data structures remain in the cache and which are
transient.
Detailed implementations of these usage models are covered in the
following sections.
Non-temporal Stores and Software Write-Combining
Use non-temporal stores in the cases when the data to be stored is:
• write-once (non-temporal)
• too large and thus cause cache thrashing
Non-temporal stores do not invoke a cache line allocation, which means
they are not write-allocate. As a result, caches are not polluted and no
dirty writeback is generated to compete with useful data bandwidth.
Without using non-temporal stores, bus bandwidth will suffer when
caches start to be thrashed because of dirty writebacks.
In Streaming SIMD Extensions implementation, when non-temporal
stores are written into writeback or write-combining memory regions,
these stores are weakly-ordered and will be combined internally inside
the processor’s write-combining buffer and be written out to memory as