Intel 80200 Computer Hardware User Manual


 
Developers Manual March, 2003 B-31
Intel
®
80200 Processor based on Intel
®
XScale
Microarchitecture
Optimization Guide
B.4.4.8. Cache Blocking
Cache blocking techniques, such as strip-mining, are used to improve temporal locality of the data.
Given a large data set that can be reused across multiple passes of a loop, data blocking divides the
data into smaller chunks which can be loaded into the cache during the first loop and then be
available for processing on subsequence loops thus minimizing cache misses and reducing bus
traffic.
As an example of cache blocking consider the following code:
for(i=0; i<10000; i++)
for(j=0; j<10000; j++)
for(k=0; k<10000; k++)
C[j][k] += A[i][k] * B[j][i];
The variable A[i][k] is completely reused. However, accessing C[j][k] in the j and k loops can
displace A[i][j] from the cache. Using blocking the code becomes:
for(i=0; i<10000; i++)
for(j1=0; j<100; j++)
for(k1=0; k<100; k++)
for(j2=0; j<100; j++)
for(k2=0; k<100; k++)
{
j = j1 * 100 + j2;
k = k1 * 100 + k2;
C[j][k] += A[i][k] * B[j][i];
}
B.4.4.9. Prefetch Unrolling
When iterating through a loop, data transfer latency can be hidden by prefetching ahead one or
more iterations. The solution incurs an unwanted side affect that the final interactions of a loop
loads useless data into the cache, polluting the cache, increasing bus traffic and possibly evicting
valuable temporal data. This problem can be resolved by prefetch unrolling. For example consider:
for(i=0; i<NMAX; i++)
{
prefetch(data[i+2]);
sum += data[i];
}
Interactions i-1 and i, prefetches superfluous data. The problem can be avoid by unrolling the end
of the loop.
for(i=0; i<NMAX-2; i++)
{
prefetch(data[i+2]);
sum += data[i];
}
sum += data[NMAX-2];
sum += data[NMAX-1];
Unfortunately, prefetch loop unrolling does not work on loops with indeterminate iterations.