5–8 Alpha Architecture Handbook
• Both conditional branches are forward branches, so they are properly predicted not to
be taken (to match the common case of no contention for the lock).
• The OR writes its result to a second register; this allows the OR and the BLBS to be
interchanged if that would give a faster instruction schedule.
• Other operate instructions (from the critical section) may be scheduled into the
LDQ_L..STQ_C sequence, so long as they do not fault or trap and they give correct
results if repeated; other memory or operate instructions may be scheduled between the
STQ_C and BEQ.
• The memory barrier instructions are discussed in Section 5.5.4. It is correct to substitute
WMB for the second MB only if:
– All data locations that are read or written in the critical section are accessed only
after acquiring a software lock by using lock_variable (and before releasing the
software lock).
– For each read u of shared data in the critical section, there is a write v such that:
1. v is BEFORE the WMB
2. v follows u in processor issue sequence (see Section 5.6.1.1)
3. v either depends on u (see Section 5.6.1.7) or overlaps u (see Section 5.6.1), or
both.
– Both lock_variable and all the shared data are in memory-like regions (or
lock_variable and all the shared data are in non-memory-like regions). If the
lock_variable is in a non-memory-like region, the atomic lock protocol must use
some implementation-specific hardware support.
Generally, the substitution of a WMB for the second MB increases performance.
• An ordinary STQ instruction is used to clear the lock_variable.
It would be a performance mistake to spin-wait by repeating the full LDQ_L..STQ_C sequence
(to move the BLBS after the BEQ) because that sequence may repeatedly change the software
lock_variable from "locked" to "locked," with each write causing extra access delays in all
other caches that contain the lock_variable. In the extreme, spin-waits that contain writes may
deadlock as follows:
If, when one processor spins with writes, another processor is modifying (not changing)
the lock_variable, then the writes on the first processor may cause the STx_C of the
modify on the second processor always to fail.
This deadlock situation is avoided by:
• Having only one processor execute a store (no STx_C), or
• Having no write in the spin loop, or
• Doing a write only if the shared variable actually changes state (1 → 1 does not change
state).