IA-32 Intel® Architecture Optimization
7-20
The best practice to reduce the overhead of thread synchronization is to
start by reducing the application’s requirements for synchronization.
Intel Thread Profiler can be used to profile the execution timeline of
each thread and detect situations where performance is impacted by
frequent occurrences of synchronization overhead.
Several coding techniques and operating system (OS) calls that are
frequently used for thread synchronization. These include spin-wait
loops, spin-locks, critical sections, to name a few. Choosing the optimal
OS calls for the circumstance and implementing synchronization code
with parallelism in mind are critical in minimizing the cost of handling
thread synchronization.
SSE3 provides two instructions (MONITOR/MWAIT) to help
multithreaded software improve synchronization between multiple
agents. In the first implementation of MONITOR and MWAIT, these
instructions are available to operating system so that operating system
can optimize thread synchronization in different areas. For example, an
operating system can use MONITOR and MWAIT in its system idle
loop (known as C0 loop) to reduce power consumption. An operating
system can also use MONITOR and MWAIT implement its C1 loop to
improve the responsiveness of C1 loop. (See Chapter 7 in the IA-32
Intel® Architecture Software Developer’s Manual, Volume 3A).
Choice of Synchronization Primitives
Thread synchronization often involves modifying some shared data
while protecting the operation using synchronization primitives. There
are many primitives to choose from; guidelines that are useful when
selecting synchronization primitives are:
• Favor compiler intrinsics or an OS provided interlocked API for
atomic updates of simple data operation, such as increment and
compare/exchange. This will be more efficient than other more
complicated synchronization primitives with higher overhead. For
more information on using different synchronization primitives, see