Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

IA-32 Intel® Architecture Optimization

3-36

This situation can be avoided if the loop is blocked with respect to the

cache size. In Figure 3-3, a

block_size is selected as the loop blocking

factor. Suppose that

block_size is 8, then the blocked chunk of each

array will be eight cache lines (32 bytes each). In the first iteration of the

inner loop,

A[0, 0:7] and B[0, 0:7] will be brought into the cache.

B[0, 0:7] will be completely consumed by the first iteration of the

outer loop. Consequently,

B[0, 0:7] will only experience one cache

miss after applying loop blocking optimization in lieu of eight misses

for the original algorithm. As illustrated in Figure 3-3, arrays

A and B are

blocked into smaller rectangular chunks so that the total size of two

blocked

A and B chunks is smaller than the cache size. This allows

maximum data reuse.

Figure 3-3 Loop Blocking Access Pattern

OM15158

A (i, j) access pattern

j

i

A(i, j) access pattern

after blocking

B(i, j) access pattern

after blocking

+

< cache size

Blocking

previous next