IA-32 Intel® Architecture Optimization
7-62
Write-combining buffers are another example of execution resources
shared between two logical processors. With two threads running
simultaneously on a processor supporting Hyper-Threading Technology,
the
writes of both threads count toward the limit of four
write-combining buffers. For example: if an inner loop that writes to
three separate areas of memory per iteration is run by two threads
simultaneously, the total number of cache lines written could be six.
This being true, the code loses the benefits of write-combining.
Loop-fission applied to this situation creates two loops, neither of which
is allowed to write to more than two cache lines per iteration.