Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
6-46
Later, the processor re-reads the data using prefetchnta, which ensures
maximum bandwidth, yet minimizes disturbance of other cached
temporal data by using the non-temporal (NTA) version of prefetch.
Conclusions from Video Encoder and Decoder
Implementation
These two examples indicate that by using an appropriate combination
of non-temporal prefetches and non-temporal stores, an application can
be designed to lessen the overhead of memory transactions by
preventing second-level cache pollution, keeping useful data in the
second-level cache and reducing costly write-back transactions. Even if
an application does not gain performance significantly from having data
ready from prefetches, it can improve from more efficient use of the
second-level cache and memory. Such design reduces the encoder’s
demand for such critical resource as the memory bus. This makes the
system more balanced, resulting in higher performance.
Optimizing Memory Copy Routines
Creating memory copy routines for large amounts of data is a common
task in software optimization.
Example 6-10 presents a basic algorithm for a the simple memory copy.
This task can be optimized using various coding techniques. One
technique uses software prefetch and streaming store instructions. It is
discussed in the following paragraph and a code example is shown in
Example 6-11.
Example 6-10 Basic Algorithm of a Simple Memory Copy
#define N 512000
double a[N], b[N];
for (i = 0; i < N; i++) {
b[i] = a[i];
}