B-8 March, 2003 Developer’s Manual
Intel
®
80200 Processor based on Intel
®
XScale
™
Microarchitecture
Optimization Guide
B.2.4 Memory Pipeline
The memory pipeline consists of two stages, D1 and D2. The data cache unit, or DCU, consists of
the data-cache array, mini-data cache, fill buffers, and writebuffers. The memory pipeline handles
load / store instructions.
B.2.4.1. D1 and D2 Pipestage
Operation begins in D1 after the X1 pipestage has calculated the effective address for load/stores.
The data cache and mini-data cache returns the destination data in the D2 pipestage. Before data is
returned in the D2 pipestage, sign extension and byte alignment occurs for byte and half-word
loads.
B.2.5 Multiply/Multiply Accumulate (MAC) Pipeline
The Multiply-Accumulate (MAC) unit executes the multiply and multiply-accumulate instructions
supported by the Intel
®
80200 processor core. The MAC implements the 40-bit Intel
®
80200
processor accumulator register acc0 and handles the instructions, which transfer its value to and
from general-purpose ARM registers.
The following are important characteristics about the MAC:
• The MAC is not truly pipelined, as the processing of a single instruction may require use of the
same datapath resources for several cycles before a new instruction can be accepted. The type
of instruction and source arguments determines the number of cycles required.
• No more than two instructions can occupy the MAC pipeline concurrently.
• When the MAC is processing an instruction, another instruction may not enter M1 unless the
original instruction completes in the next cycle.
• The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and
memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.
• The MAC can achieve throughput of one multiply per cycle when performing a 16 by 32 bit
multiply.
B.2.5.1. Behavioral Description
The execution of the MAC unit starts at the beginning of the M1 pipestage, where it receives two
32-bit source operands. Results are completed N cycles later (where N is dependent on the operand
size) and returned to the register file. For more information on MAC instruction latencies, refer to
Section 14.4, “Instruction Latencies”.
An instruction that occupies the M1 or M2 pipestages also occupy the X1 and X2 pipestage,
respectively. Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may
complete anywhere from M2-M5. If a MAC operation enters M3-M5, it is considered committed
because it modifies architectural state regardless of subsequent events.