Support User Manuals

Intel IA-32 Computer Accessories User Manual

Open as PDF

of 568

IA-32 Intel® Architecture Optimization

4-36

Let us now consider a case with a series of small loads after a large store

to the same area of memory (beginning at memory address

mem) as

shown in Example 4-26. Most of the small loads will stall because they

are not aligned with the store; see “Store Forwarding” in Chapter 2 for

more details.

The word loads must wait for the quadword store to write to memory

before they can access the data they require. This stall can also occur

with other data types (for example, when doublewords or words are

stored and then words or bytes are read from the same area of memory).

When you change the code sequence as shown in Example 4-27, the

processor can access the data without delay.

Example 4-26 A Series of Small Loads after a Large Store

movq mem, mm0 ; store qword to address “mem"

:

:

mov bx, mem + 2 ; load word at “mem + 2" stalls

mov cx, mem + 4 ; load word at “mem + 4" stalls

Example 4-27 Eliminating Delay for a Series of Small Loads after a Large Store

movq mem, mm0 ; store qword to address “mem"

:

:

movq mm1, mem ; load qword at address “mem"

movd eax, mm1 ; transfer “mem + 2" to eax from

; MMX register, not memory

psrlq mm1, 32

shr eax, 16

movd ebx, mm1 ; transfer “mem + 4" to bx from

; MMX register, not memory

and ebx, 0ffffh

previous next