NEC 5800/1000 Personal Computer User Manual


 
5
VLC Architecture
High-speed / low latency Intra-Cell cache-to-cache data transfer
The Express5800/1000 series server
implements the VLC architecture, which
allows for low latency cache-to-cache
data transfer between multiple CPUs
within a cell.
In a split BUS architecture, for a cache-
to-cache data transfer to take place, the
data must be passed through a chipset.
However, in the VLC architecture,
data within the cache memory can
be accessed directly by one another,
bypassing the chipset. This allows
for lower latency between the cache
memory, which results in faster data
transfers.
Dedicated Cache Coherency Interface (CCI)
High-speed / low latency Inter-Cell cache-to-cache data transfer
Another technology implemented in the Express5800/1000 series
server to improve cache-to-cache data transfer is the Cache
Coherency Interface (CCI). CCI, the inter-Cell counterpart of the
VLC architecture, allows for a lower latency cache-to-cache data
transfer between Cells.
Information containing the location and state of cached data is
required for the CPU to access the specific data stored in cache
memory. By accessing the cache memory according to this
information, the CPU is able to retrieve the desired data.
Two main mechanisms exist for cache-to-cache data transfer
between Cells, directory based and TAG based cache coherency.
The cache information, described above, is stored in external
memory (DIR memory) for the directory based, and within the
chipset for the TAG based mechanisms.
In a directory based system, the requestor CPU will first access the
external memory to confirm the location of the cached data, and
then will access the appropriate cache memory. On the other hand,
in a TAG based system, the requestor CPU broadcasts a request to
all other cache simultaneously via TAG.
Crossbar-less configuration
Improved data transfer latency through direct attached Cell configuration
Within the Express5800/1000 series server lineup, the 1080Rf
has been able to lower the data transfer latency by removing the
crossbar and directly connecting Cell to Cell, and Cell to PCI box.
Even with the crossbar-less configuration, virtualization of the Cell
card and I/O box has been retained as not to diminish computing
and I/O resources.
CPU
L3
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
CPU
L3
L3 of other CPU
CPU
L3
L3 of other CPU
L3 of other
CPU on
different FSB
L3 of other CPU
on same FSB
L3 of other CPU on
different FSB
CPU
L3
Increased enterprise
applications
performance through
reduced cache memory
access latency
Very Large Cache (VLC) Architecture
Intel
®
Itanium
®
2 processor
(Madison : L3 9MB)
Latency
Dual-Core Intel
®
Itanium
®
processor
(Montvale : L3 24MB)
Latency
CPUCPU CPU
Cache
Memory
Cache
Memory
CPU
Cache
Memory
Cache
Memory
Intel
®
Itanium
®
2 processor
(Madison : L3 9MB)
Latency
High-speed
cache-to-cache
transfers
Direct CPU-to-CPU transfers
FSB
Data Size
Data Size
Memory
Dual-Core Intel
®
Itanium
®
processor
(Montvale : L3 24MB)
Latency
Split BUS Architecture
Data Size
CPUCPU CPU
Cache
Memory
Cache
Memory
CPU
Cache
Memory
Cache
Memory
chipset
Data transfer controller
Latency
degradation
(approx 3x)
This area increases
due to the increase in
cache size and
higher latency
Overhead from transferring
data through the chipset.
FSB FSBchipset
Higher cache memory
access latency.
Non-uniform
cache-to-cache data
transfer.
Inconsistent
performance.
Data Size
Higher
latency
(approx 3x)
This image does not depict actual numbers
Memory
chipset
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
Cache
Memory
L3 of
other CPU on
same FSB
The benefit of the TAG based mechanism, thus implemented in
the Express5800/1000 series server, is that by accessing the
TAG, unnecessary inquiries to the cache memory are filtered for a
smoother transfer of data. Furthermore, the Express5800/1000
series server includes a dedicated high-speed cache coherency
interface (CCI) which is used to connect the Cells directly to
one another without using a crossbar. This interface is used for
broadcasting and other cache coherency transactions to allow for
even faster cache-to-cache data transfer.
CPU requesting the information
CPU storing the newest information
Memory that is storing location regarding
the memory
TAG memory (Manages cache line
information for all of the CPUs loaded on a
CELL card)
DIR Memory (Manages cache line
information for all of the memory loaded on
a CELL card)
Tag Based Cache Coherency
Directory Based Cache Coherency
Request is broadcasted to all CPU
simultaneously
The Express5800/1000 Series server
implements a dedicated connection (CCI)
for snooping
Access Directory to confirm the location of
the data first, then access the appropriate
cache memory
Memory
CPU
CPU
DIR
TAG
Memory
CPU CPU CPU CPU CPU CPU CPU
Memory
CPU CPU CPU CPU
Memory
Memory
CPU CPU CPU
DIR
CPU CPU CPU CPU
Memory
CPU CPU CPU CPU
DIR
CPU
DIR
Memory
Directory Based Cache Coherency
A
3
Chipset
chip
set
chip
set
chip
set
chip
set
chip
set
chip
set
CPU CPU
chip
set
chip
set
chip
set
chip
set
chip
set
chip
set
chip
set
chip
set
CPU
chip
set
chip
set
CPU
TAG
CPU CPU
Memory
DIR
Performance
increase with
the A
3
chipset
TAG TAG TAG
CPU