IA-32 Intel® Architecture Optimization
3-34
In Example 3-19, the computation has been strip-mined to a size
strip_size. The value strip_size is chosen such that strip_size
elements of array
v[Num] fit into the cache hierarchy. By doing this, a
given element
v[i] brought into the cache by Transform(v[i]) will
still be in the cache when we perform
Lighting(v[i]), and thus
improve performance over the non-strip-mined code.
Loop Blocking
Loop blocking is another useful technique for memory performance
optimization. The main purpose of loop blocking is also to eliminate as
many cache misses as possible. This technique transforms the memory
domain of a given problem into smaller chunks rather than sequentially
traversing through the entire memory domain. Each chunk should be
small enough to fit all the data for a given computation into the cache,
thereby maximizing data reuse. In fact, one can treat loop blocking as
strip mining in two or more dimensions. Consider the code in
Example 3-18 and access pattern in Figure 3-3. The two-dimensional
array
A is referenced in the j (column) direction and then referenced in
the
i (row) direction (column-major order); whereas array B is
referenced in the opposite manner (row-major order). Assume the
memory layout is in column-major order; therefore, the access strides of
array
A and B for the code in Example 3-20 would be 1 and MAX,
respectively.