Maintaining virtual network connectivity from infrastructure servers to testbed nodes across multiple disparate underlay networks is a matter of managing 3 types of network settings
In the testbed environment, static settings are are set by the testbed configuration management system when an infrastructure node is commissioned. Virtual network settings are created in in response to a new experiment network being created, and are subsequently destroyed when the network is removed. Virtual network member settings are created in response to a new member joining a virtual network, and removed when the member leaves.
Static ARP entry for peer router
In order to route packets we need to know the next hop. Our infrastructure server is on the overlay network, and it knows its next hop on that network. However we also need to route packets into the underlay network. Since the infrastructure server itself is not on this network, we create a surrogate next hop that points to the interface connected to the next router in the overlay. In the testbed, all of our routers are directly connected, so this surrogate next hop pointer is just to an interface (as opposed to an IP).
The following line creates the next hop entry. This is a static ARP entry. The
169.254.0.1, is not real, it's just something to point at. Here the
<ifx> parameter is the physical interface that goes to the next hop. The
parameter specifies the MAC address of the next-hop peer router.
ip neighbor add lladdr <mac> 169.254.0.1 dev <ifx>
Routes to leaves
Leaf routers connect directly to experiment nodes, as such they are the are the translation point between the overlay and underlay network. When experiment nodes send packets to a leaf, it encapsulates the packets from the underlay and sends them across the routed overlay to their destination. Infrastructure servers are another type of leaf, but instead of connecting to experiment nodes, they connect to testbed services. Note also, that virtual machines hosts are another type of leaf that provide connectivity to a set of virtual machines.
In order for a leaf to send packets over the routed overlay to a destination leaf, it has to know what the destination address is and how to reach it. All the routers in the testbed form a BGP network. It's conventional for BGP routers to have /32 addresses and let the routing daemons take care of figuring out the appropriate next hop. Routing daemons are split into an upper and a lower half. The upper half implements the routing protocol (BGP + EVPN extensions in our case) and the lower half takes care of actually updating the underlying platforms forwarding tables. On testbed infrastructure servers, our upper half is GoBGP. The lower half is implemented by us.
Why implement the lower half? GoBGP does have support for Zebra integration (the Quagga / FRR lower half) but it does not have support for the latest versions of Zebra that have EVPN support. Moreover, creating a daemon that observes GoBGP to implement the lower half is relatively simple. We only have to maintain
- Leaf routes
- VTEP interfaces
- VTEP forwarding rules
We describe the leaf routes now and present the other two in later sections. The leaf routes take the following form.
ip route add <leaf-bgp-ip> via 169.254.0.1 dev <ifx> proto bgp metric 20 onlink
What this means is that we let the BGP network figure out how route packets over the mesh (overlay). Each leaf just has to know the addresses of the other leaf routers. The spines and fabric switches take care of routing in between (also running BGP daemons).
Virtual network settings
Because each leaf acts as the translation point between the overlay and underlay networks, it has the responsibility of encapsulating and de-encapsulating packets as they transit between the two domains. There is a special type of virtual interface that does this called a vxlan tunnel end point (VTEP). When virtual networks are created and destroyed, all leaf nodes attached to devices in the network must provide a VTEP for the network. Each VTEP is tied to a 24-bit integer called the virtual network identifier (VNI) that uniquely identifies the virtual network to which it belongs.
Creating a VTEP device is accomplished as follows.
ip link add vtep-<VNI> type vxlan id <VNI> dev <ifx> dstport 4789 local <local-vtep-ip> nolearning
VNI is the virtual network identifier.
ifx is the parent interface of
the VTEP. This can be either a physical interface, or a bridge. In most
cases for testbed infrastructure servers, the parent interface will be a bridge.
The service or set of services being provided by the infrastructure node
will communicate over parent bridge. The
local-vtep-ip parameter gets
assigned the /32 BGP address of the leaf. The
nolearning parameter is required for BUM entries to work (more on that later).
Virtual network member settings
As members come and go from testbed underlay networks - from dynamically adding or removing physical or virtual machines for from an experiment example, we need to keep track of forwarding information on each leaf that provides network access. This comes in the form of forwarding database (FDB) entires.
Forwarding database entries
There are two kinds of FDB entries we need to maintain.
The first is for broadcast, unknown and multicast (BUM) traffic. The packets either need to go to all leaves, many leaves, or we don't know what leaf they need to go to so we send them to all leaves. BUM FDB entries depend on the VNI and the remote leaf IP address. Every leaf contains BUM entries for each VNI it provides service for and every other leaf that provides services for that VNI. This way every BUM packet goes to every leaf providing service to the virtual network.
bridge fdb append 00:00:00:00:00:00 dev <vtep-XXXX> dst <remote-vtep-ip>
Packets that do not fall into the BUM category have a known destination. EVPN distributes this information throughout the network. When a new machine comes online inside a virtual network, all the BGP daemons running in the testbed get a notification containing the MAC address of the new machine and what virtual network it belongs to and what leaf provides access to it. This information is then use to formulate the following FDB entry.
bridge fdb append <mac> dev <vtep-XXXX> dst <leaf-ip>
The way forwarding engines work is by searching the FDB for a matching mac entry. If one is found, it's used for forwarding. If one is not found, all BUM entries are used. So this scheme works to (somewhat) optimally deliver packets in both cases.
Testbed automation: who does what?
Rex and Canopy are responsible for establishing VTEPs. Because Rex is the one that sets up the virtual networks this is a natural fit. Rex knows what devices belong to a particular virtual network and what leaf they are attached to. Rex uses canopy to set up the VTEP on leaf devices.
Maintaining Routes and FDBs
For this we have developed a separate daemon called the GoBGP base layer engine or simply Gobble.
Link plumbing has two basic cases
Basic links are a straight up connection between testbed devices. Emulated links have a emulation box transparently placed in between links.
Basic links are just your standard run of the mill VXLAN segments. The situation is depicted below. This is a simple spine-leaf system, but there can be any sort of network in the middle including multi-tier and folded clos.
A VTEP is created on the same 'green' VNI on each leaf. Routes are propagated on the first BUM packets between X and Z.
Emulated links transparently forward packets through an emulator. The situation is depicted below.
For a point-to-point link, two independent VXLAN segments are set up with different VNIs. The MAC addresses are for X and Y are depicted as
::AA respectively. We require that when X sends packets to Z they go through the emulation node. This is accomplished by advertising Z's mac to be located behind the emulation node on the green virtual network. Remember Z is actually on the blue virtual network. So when X, being on the green virtual network, wants to send packets to Z, the overlay forwards the packets to the emulation node, because that is where Z's mac address is on the green network. The emulation node performs it's dark art on the packets and then kicks them back out on the network, still destined for Z's mac, but this time on the blue network. Now the overlay will forward the packets to Z.