Multi-Core and Hyper-Threading Technology 7
7-39
block size for loop blocking should be determined by dividing the target
cache size by the number of logical processors available in a physical
processor package. Typically, some cache lines are needed to access
data that are not part of the source or destination buffers used in cache
blocking, so the block size can be chosen between one quarter to one
half of the target cache (see also, Chapter 3).
Software can use the deterministic cache parameter leaf of CPUID to
discover which subset of logical processors are sharing a given cache.
(See Chapter 6.) Therefore, guideline above can be extended to allow all
the logical processors serviced by a given cache to use the cache
simultaneously, by placing an upper limit of the block size as the total
size of the cache divided by the number of logical processors serviced
by that cache. This technique can also be applied to single-threaded
applications that will be used as part of a multitasking workload.
User/Source Coding Rule 32. (H impact, H generality) Use cache blocking
to improve locality of data access. Target one quarter to one half of the cache
size when targeting IA-32 processors supporting Hyper-Threading Technology
or target a block size that allow all the logical processors serviced by a cache
to share that cache simultaneously.
Shared-Memory Optimization
Maintaining cache coherency between discrete processors frequently
involves moving data across a bus that operates at a clock rate
substantially slower that the processor frequency.
Minimize Sharing of Data between Physical Processors
When two threads are executing on two physical processors and sharing
data, reading from or writing to shared data usually involves several bus
transactions (including snooping, request for ownership changes, and
sometimes fetching data across the bus). A thread accessing a large
amount of shared memory is likely to have poor processor-scaling
performance.