Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
2-84
improve address alignment, a small piece of prolog code using
movsb/stosb with count less than 4 can be used to peel off the
non-aligned data moves before starting to use movsd/stosd.
For cases where N is less than half the size of last level cache,
throughput consideration may favor either: (a) an approach using
REP string with the largest data granularity because REP string has
little overhead for loop iteration, and the branch misprediction
overhead in the prolog/epilogue code to handle address alignment is
amortized over many iterations (b) an iterative approach using the
instruction with largest data granularity; where the overhead for
SIMD feature detection, iteration overhead, prolog/epilogue for
alignment control can be minimized. The trade-off between these
approaches may depend on the microarchitecture.
An example of memset() implemented using stosd for arbitrary
counter value with the destination address aligned to doubleword
boundary in 32-bit mode is shown in Table 2-5.
For cases N > half the size of last level cache, using 16-byte
granularity streaming stores with prolog/epilog for address
alignment will likely be more efficient, if the destination addresses
will not be referenced immediately afterwards.