only if the system comes up through a fast boot reload. The BGP route selection algorithm only selects
one best path to each destination and delays installation of additional ECMP paths until a minimum of 30
seconds has elapsed from the time the first BGP peer is established. Once this time has elapsed, all routes
in the BGP RIB are processed for additional paths.
While the above change will ensure that at least one path to each destination gets into the FIB as quickly
as possible, it does prevent additional paths from being used even if they are available. This downside has
been deemed to be acceptable.
RDMA Over Converged Ethernet (RoCE) Overview
This functionality is supported on the platform.
RDMA is a technology that a virtual machine (VM) uses to directly transfer information to the memory of
another VM, thus enabling VMs to be connected to storage networks. With RoCE, RDMA enables data to
be forwarded without passing through the CPU and the main memory path of TCP/IP. In a deployment
that contains both the RoCE network and the normal IP network on two different networks, RRoCE
combines the RoCE and the IP networks and sends the RoCE frames over the IP network. This method of
transmission, called RRoCE, results in the encapsulation of RoCE packets to IP packets. RRoCE sends
Infini Band (IB) packets over IP. IB supports input and output connectivity for the internet infrastructure.
Infini Band enables the expansion of network topologies over large geographical boundaries and the
creation of next-generation I/O interconnect standards in servers.
When a storage area network (SAN) is connected over an IP network, the following conditions must be
satisfied:
• Faster Connectivity: QoS for RRoCE enables faster and lossless nature of disk input and output
services.
• Lossless connectivity: VMs require the connectivity to the storage network to be lossless always.
When a planned upgrade of the network nodes happens, especially with top-of-rack (ToR) nodes
where there is a single point of failure for the VMs, disk I/O operations are expected to occur in 20
seconds. If disk in not accessible in 20 seconds, unexpected and undefined behavior of the VMs
occurs. You can optimize the booting time of the ToR nodes that experience a single point of failure
to reduce the outage in traffic-handling operations.
RRoCE is bursty and uses the entire 10-Gigabit Ethernet interface. Although RRoCE and normal data
traffic are propagated in separate network portions, it may be necessary in certain topologies to combine
both the RRoCE and the data traffic in a single network structure. RRoCE traffic is marked with dot1p
priorities 3 and 4 (code points 011 and 100, respectively) and these queues are strict and lossless. DSCP
code points are not tagged for RRoCE. Both ECN and PFC are enabled for RRoCE traffic. For normal IP or
data traffic that is not RRoCE-enabled, the packets comprise TCP and UDP packets and they can be
marked with DSCP code points. Multicast is not supported in that network.
RRoCE packets are received and transmitted on specific interfaces called lite-subinterfaces. These
interfaces are similar to the normal Layer 3 physical interfaces except for the extra provisioning that they
offer to enable the VLAN ID for encapsulation.
You can configure a physical interface or a Layer 3 Port Channel interface as a lite subinterface. When
you configure a lite subinterface, only tagged IP packets with VLAN encapsulation are processed and
routed. All other data packets are discarded.
348
Flex Hash and Optimized Boot-Up