AMD 250 Computer Hardware User Manual


 
214 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
general, unrolling loops improves performance by providing opportunities for the processor to work
on data pertaining to the next loop iteration while waiting for the result of an operation from the
previous iteration. The reciprocal_sqrt_1xloop loop performs the reciprocation and square root
on the remaining elements that do not form a full segment of 16 floating-point values. In this chapter,
the previous function is the only example that handles any vector stream of num_points size. This is
done to preserve space, but all examples in this chapter can be modified in a similar manner and used
universally.
Additionally, the previous SSE function makes use of the PREFETCHNTA instruction to reduce
cache latency. The unrolled loop reciprocal_sqrt_4xloop was chosen to work with 64 bytes of
data per iteration, which happens to be the size of one cache line (the term used to signify the
quantum of data brought into the processor’s cache by a memory access, if the data does not reside
there already). The prefetch causes the processor to load the floating-point operands of the reciprocal
and square root operations for the next four loop iterations. While the processor works on the next
three iterations, the data for the fourth iteration is sent to the processor. The processor does not have to
wait while the aligned SSE instruction MOVAPS is fetched from memory before performing
operations on the fourth iteration. This type of memory optimization can be very useful in gaming and
high-performance computing, in which data sets are unlikely to reside in the processor’s cache. For
example, in a simulation involving a million vertices or atoms in which the storage for their
coordinates would require 12 bytes per vertex, the total space for the data would be more than 12
Mbytes.