Intel IA-32 Computer Accessories User Manual


 
Multi-Core and Hyper-Threading Technology 7
7-61
Using a function decomposition threading model, a multithreaded
application can pair up a thread with critical dependence on a
low-throughput resource with other threads that do not have the same
dependency.
User/Source Coding Rule 40. (M impact, L generality) If a single thread
consumes half of the peak bandwidth of a specific execution unit (e.g. fdiv),
consider adding a thread that seldom or rarely relies on that execution unit,
when tuning for Hyper-Threading Technology.
To ensure execution resources are shared cooperatively and efficiently
between two logical processors, it is important to reduce stall
conditions, especially those conditions causing the machine to flush its
pipeline.
The primary indicator of a Pentium 4 processor pipeline stall condition
is called Machine Clear. The metric is available from the VTune
Analyzer’s event sampling capability. When the machine clear
condition occurs, all instructions that are in flight (at various stages of
processing in the pipeline) must be resolved and then they are either
retired or cancelled. While the pipeline is being cleared, no new
instructions can be fed into the pipeline for execution. Before a machine
clear condition is de-asserted, execution resources are idle.
Reducing the machine clear condition benefits single-thread
performance because it increases the frequency scaling of each thread.
The impact is even higher on processors supporting Hyper-Threading
Technology, because a machine clear condition caused by one thread
can impact other threads executing simultaneously.
Several performance metrics can be used to detect situations that may
cause a pipeline to be cleared. The primary metric is the Machine Clear
Count: it indicates the total number of times a machine clear condition is
asserted due to any cause. Possible causes include memory order
violations and self-modifying code. Assists while executing x87 or SSE
instructions have a similar effect on the processor’s pipeline and should
be reduced to a minimum.