Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
3-36
This situation can be avoided if the loop is blocked with respect to the
cache size. In Figure 3-3, a
block_size is selected as the loop blocking
factor. Suppose that
block_size is 8, then the blocked chunk of each
array will be eight cache lines (32 bytes each). In the first iteration of the
inner loop,
A[0, 0:7] and B[0, 0:7] will be brought into the cache.
B[0, 0:7] will be completely consumed by the first iteration of the
outer loop. Consequently,
B[0, 0:7] will only experience one cache
miss after applying loop blocking optimization in lieu of eight misses
for the original algorithm. As illustrated in Figure 3-3, arrays
A and B are
blocked into smaller rectangular chunks so that the total size of two
blocked
A and B chunks is smaller than the cache size. This allows
maximum data reuse.
Figure 3-3 Loop Blocking Access Pattern
OM15158
A (i, j) access pattern
j
i
A(i, j) access pattern
after blocking
B(i, j) access pattern
after blocking
+
< cache size
Blocking