Intel IA-32 Computer Accessories User Manual


 
Optimizing Cache Usage 6
6-53
The baseline for performance comparison is the throughput (bytes/sec)
of 8-MByte region memory copy on a first-generation Pentium M
processor (CPUID signature 0x69n) with a 400-MHz system bus using
byte-sequential technique similar to that shown in Example 6-10. The
degree of improvement relative to the performance baseline for newer
IA-32 processors and platforms with higher system bus speed using
different coding techniques are compared.
The second coding technique moves data at 4-Byte granularity using
REP string instruction. The third column compares the performance of
the coding technique listed in Example 6-11. The fourth column of
performance compares the throughput of fetching 4-KBytes of data at a
time (using hardware prefetch to aggregate bus read transactions) and
writing to memory via 16-Byte streaming stores.
Increases in bus speed is the primary contributor to throughput
improvements. The technique shown in Example 6-12 will likely take
advantage of the faster bus speed in the platform more efficiently.
Additionally, increasing the block size to multiples of 4-KBytes while
keeping the total working set within the second-level cache can improve
the throughput slightly.
The relative performance figure shown in Table 6-2 is representative of
clean microarchitectual conditions within a processor (e.g. looping s
simple sequence of code many times). The net benefit of integrating a
specific memory copy routine into an application (full-featured
applications tend to create many complicated micro-architectural
conditions) will vary for each application.
Deterministic Cache Parameters
If CPUID support the function leaf with input EAX = 4, this is referred
to as the deterministic cache parameter leaf of CPUID (see CPUID
instruction in IA-32 Intel® Architecture Software Developer’s Manual,
Volume 2A). Software can use the deterministic cache parameter leaf to