Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

Optimizing Cache Usage 6

6-49

In Example 6-11, eight _mm_load_ps and _mm_stream_ps intrinsics are

used so that all of the data prefetched (a 128-byte cache line) is written

back. The prefetch and streaming-stores are executed in separate loops

to minimize the number of transitions between reading and writing data.

This significantly improves the bandwidth of the memory accesses.

// copy 128 byte per loop

for (j=kk; j<kk+NUMPERPAGE; j+=16) {

_mm_stream_ps((float*)&b[j],

_mm_load_ps((float*)&a[j]));

_mm_stream_ps((float*)&b[j+2],

_mm_load_ps((float*)&a[j+2]));

_mm_stream_ps((float*)&b[j+4],

_mm_load_ps((float*)&a[j+4]));

_mm_stream_ps((float*)&b[j+6],

_mm_load_ps((float*)&a[j+6]));

_mm_stream_ps((float*)&b[j+8],

_mm_load_ps((float*)&a[j+8]));

_mm_stream_ps((float*)&b[j+10],

_mm_load_ps((float*)&a[j+10]));

_mm_stream_ps((float*)&b[j+12],

_mm_load_ps((float*)&a[j+12]));

_mm_stream_ps((float*)&b[j+14],

_mm_load_ps((float*)&a[j+14]));

} // finished copying one block

} // finished copying N elements

_mm_sfence();

Example 6-11 A Memory Copy Routine Using Software Prefetch

previous next