Intel IA-32 Computer Accessories User Manual


 
IA-32 Intel® Architecture Optimization
4-40
same DRAM page have shorter latencies than sequential accesses to
different DRAM pages. In many systems the latency for a page miss
(that is, an access to a different page instead of the page previously
accessed) can be twice as large as the latency of a memory page hit
(access to the same page as the previous access). Therefore, if the loads
and stores of the memory fill cycle are to the same DRAM page, a
significant increase in the bandwidth of the memory fill cycles can be
achieved.
Increasing UC and WC Store Bandwidth by Using Aligned
Stores
Using aligned stores to fill UC or WC memory will yield higher
bandwidth than using unaligned stores. If a UC store or some WC stores
cross a cache line boundary, a single store will result in two transaction
on the bus, reducing the efficiency of the bus transactions. By aligning
the stores to the size of the stores, you eliminate the possibility of
crossing a cache line boundary, and the stores will not be split into
separate transactions.
Converting from 64-bit to 128-bit SIMD Integer
The SSE2 define a superset of 128-bit integer instructions currently
available in MMX technology; the operation of the extended
instructions remains the same and simply operate on data that is twice as
wide. This simplifies porting of current 64-bit integer applications.
However, there are few additional considerations:
Computation instructions which use a memory operand that may not
be aligned to a 16-byte boundary must be replaced with an
unaligned 128-bit load (
movdqu) followed by the same computation
operation that uses instead register operands. Use of 128-bit integer
computation instructions with memory operands that are not 16-byte
aligned will result in a General Protection fault. The unaligned
128-bit load and store is not as efficient as the corresponding