6-6 SA-1100
Developer’s Manual
Caches, Write Buffer, and Read Buffer
6.3.2.2 Writes to a Bufferable and Noncacheable Location (B=1,C=0)
If the write buffer is enabled and the processor performs a write to a bufferable but noncacheable
location and misses in the Dcaches, the data is placed in the write buffer and the CPU continues
execution. As with the cacheable case, merging is allowed only on store multiples. The write buffer
performs the external write sometime later.
6.3.2.3 Unbufferable Writes (B=0)
If the write buffer is disabled or the CPU performs a write to an unbufferable area, the processor is
stalled until the write buffer empties and the write completes externally. This requires several
external clock cycles.
6.3.3 Enabling the Write Buffer
To enable the write buffer, ensure that the MMU is enabled by setting bit 0 in the control register,
then enable the write buffer by setting bit 3 in the control register. The MMU and write buffer can
be enabled simultaneously with a single write to the control register.
6.3.3.1 Disabling the Write Buffer
To disable the write buffer, clear bit 3 in the control register. Any writes already in the write buffer
will complete normally, but a drain write buffer needs to be done to force all writes out to memory.
Note: The write buffer is used for copy-backs from the Dcaches even when they are disabled.
6.4 Read Buffer (RB)
The SA-1100 contains a software-programmable read buffer that can increase the performance of
critical loop code by prefetching data. The RB enables the preallocation of read-only data into one
of four 32-byte buffers without stalling the pipe. For subsequent loads that hit in the RB, data is
sourced from the buffer instead of the Dcaches at a rate of 1 word per core clock. Also, because
the programmer specifies which entry of the RB is used, critical data can be “locked” in to
eliminate bus latency.
The RB is controlled using coprocessor 15, register 9, and provides the capability to allocate 1
word, a half-line (4 words), or a full line (8 words) into one of four entries of the RB. (See
Chapter 5, “Coprocessors” for a detailed RB coprocessor description.) Half-line loads are
automatically aligned onto half-block boundaries (the lower four address bits are ignored).
Full-line loads are automatically aligned onto line boundaries (the lower five address bits are
ignored). For partial cache line RB loads, only the words actually fetched are marked valid and can
be sourced from the buffer. A small queue is used to ensure that subsequent RB load instructions go
out in order.
When an RB allocate instruction is executed, the virtual address is looked up in the TB to check for
a translation hit and possible access violations. If the access misses in the TB, the pipe is stalled
until the page is fetched through the normal hardware tablewalk mechanism. If an access violation
occurs, the RB load is NOP’d. For example, an RB allocate instruction can generate a data abort.
Once the RB allocate has received a TB hit and no access violations, a bus access is requested that
fills the appropriate buffer without stalling the core pipeline. Subsequent load instructions to this
virtual address result in an RB hit and data is sourced from the appropriate entry to the core.