General Optimization Guidelines 2
2-83
• Cache eviction:
If the amount of data to be processed by a memory routine
approaches half the size of the last level on-die cache, temporal
locality of the cache may suffer. Using streaming store instructions
(for example: movntq, movntdq) can minimize the effect of flushing
the cache. The threshold to start using a streaming store depends on
the size of the last level cache. Determine the size using the
deterministic cache parameter leaf of the CPUID instruction.
Techniques for using streaming stores for implementing a
memset()-type library must also consider that the application can
benefit from this technique only if it has no immediate need to
reference the target addresses. This assumption is easily be upheld
when testing a streaming-store implementation on a
micro-benchmark configuration, but violated in a full-scale
application situation.
When applying general heuristics to the design of general-purpose,
high-performance library routines; the following guidelines can are
useful when optimizing arbitrary size of counter value N and address
alignment. Different techniques may be necessary for optimal
performance, depending on the magnitude of N:
• For cases N < a small count, where the small count threshold will
vary between microarchitectures (empirically, 8 may be a good
value when optimizing for Intel NetBurst microarchitecture). Each
case can be coded directly without the overhead of a looping
structure. For example, 11 bytes can be processed using two movsd
explicitly and a movsb with REP counter equaling 3.
• For N not so small but less than some threshold value (this
intermediate threshold value may vary for different
micro-architectures, but can be determined empirically), A SIMD
implementation using run-time CPUID prolog will likely deliver
less throughput due to the overhead of the prolog. A REP string
implementation should favor using a REP string of doublewords. To