18   Mininet

Sometimes simulations are not possible or not practical, and network experiments must be run on actual machines. One can always use a set of interconnected virtual machines, but even pared-down virtual machines consume sufficient resources that it is hard to create a network of more than a handful of nodes. Mininet is a system that supports the creation of lightweight logical nodes that can be connected into networks. These nodes are sometimes called containers, or, more accurately, network namespaces. Virtual-machine technology is not used. These containers consume sufficiently few resources that networks of over a thousand nodes have been created, running on a single laptop. While Mininet was originally developed as a testbed for software-defined networking (2.8   Software-Defined Networking), it works just as well for demonstrations and experiments involving traditional networking.

A Mininet container is a process (or group of processes) that no longer has access to all the host system’s “native” network interfaces, much as a process that has executed the chroot() system call no longer has access to the full filesystem. Mininet containers then are assigned virtual Ethernet interfaces (see the ip-link man page entries for veth), which are connected to other containers through virtual Ethernet links. The use of veth links ensures that the virtual links behave like Ethernet, though it may be necessary to disable TSO (12.5   TCP Offloading) to view Ethernet packets in WireShark as they would appear on the (virtual) wire. Any process started within a Mininet container inherits the container’s view of network interfaces.

For efficiency, Mininet containers all share the same filesystem by default. This makes setup simple, but sometimes causes problems with applications that expect individualized configuration files in specified locations. Mininet containers can be configured with different filesystem views, though we will not do this here.

Mininet is a form of network emulation, as opposed to simulation. An important advantage of emulation is that all network software, at any layer, is simply run “as is”. In a simulator environment, on the other hand, applications and protocol implementations need to be ported to run within the simulator before they can be used. A drawback of emulation is that as the network gets large and complex the emulation may slow down. In particular, it is not possible to emulate link speeds faster than the underlying hardware can support. (It is also not possible to emulate non-Linux network software.)

The Mininet group maintains extensive documentation; three useful starting places are the Overview, the Introduction and the FAQ.

The goal of this chapter is to present a series of Mininet examples. Most examples are in the form of a self-contained Python2 file (Mininet does not at this time support Python3). Each Mininet Python2 file configures the network and then starts up the Mininet command-line interface (which is necessary to start commands on the various node containers). The use of self-contained Python files arguably makes the configurations easier to edit, and avoids the complex command-line arguments of many standard Mininet examples. The Python code uses what the Mininet documentation calls the mid-level API.

The Mininet distribution comes with its own set of examples, in the directory of that name. A few of particular interest are listed below; with the exception of linuxrouter.py, the examples presented here do not use any of these techniques.

  • bind.py: demonstrates how to give each Mininet node its own private directory (otherwise all nodes share a common filesystem)
  • controllers.py: demonstrates how to arrange for multiple SDN controllers, with different switches connecting to different controllers
  • limit.py: demonstrates how to set CPU utilization limits (and link bandwidths)
  • linuxrouter.py: creates a node that acts as a router. Any host node can act as a router, though, provided we enable forwarding with sysctl net.ipv4.ip_forward=1
  • miniedit.py: a graphical editor for Mininet networks
  • mobility.py: demonstrates how to move a host from one switch to another
  • nat.py: demonstrates how to connect hosts to the Internet
  • tree1024.py: creates a network with 1024 nodes

We will occasionally need supplemental programs as well, eg for sending, monitoring or receiving traffic. These are meant to be modified as necessary to meet circumstances; they contain few command-line option settings. Most of these supplemental programs are written, perhaps confusingly, in Python3. Python2 files are run with the python command, while Python3’s command is python3. Alternatively, given that all these programs are running under Linux, one can make all Python files executable and be sure that the first line is either #!/usr/bin/python or #!/usr/bin/python3 as appropriate.

18.1   Installing Mininet

Mininet runs only under the Linux operating system. Windows and Mac users can, however, easily run Mininet in a single Linux virtual machine. Even Linux users may wish to do this, as running Mininet has a nontrivial potential to affect normal operation (a virtual-switch process started by Mininet has, for example, interfered with the suspend feature on the author’s laptop).

The Mininet group maintains a virtual machine with a current Mininet installation at their downloads site. The download file is actually a .zip file, which unzips to a modest .ovf file defining the specifications of the virtual machine and a much larger (~2 GB) .vmdk file representing the virtual disk image. (Some unzip versions have trouble with unzipping very large files; if that happens, search online for an alternative unzipper.)

There are several choices for virtual-machine software; two options that are well supported and free (as of 2017) for personal use are VirtualBox and VMware Workstation Player. The .ovf file should open in either (in VirtualBox with the “import appliance” option). However, it may be easier simply to create a new Linux virtual machine and specify that it is to use an existing virtual disk; then select the downloaded .vmdk file as that disk.

Both the login name and the password for the virtual machine is “mininet”. Once logged in, the sudo command can be used to obtain root privileges, which are needed to run Mininet. It is safest to do this on a command-by-command basis; eg sudo python switchline.py. It is also possible to keep a terminal window open that is permanently logged in as root, eg via sudo bash.

Another option is to set up a Linux virtual machine from scratch (eg via the Ubuntu distribution) and then install Mininet on it, although the preinstalled version also comes with other useful software, such as the Pox controller for OpenFlow switches.

The preinstalled version does not, however, come with any graphical-interface desktop. One can install the full Ubuntu desktop with the command (as root) apt-get install ubuntu-desktop. This will, however, add more than 4 GB to the virtual disk. A lighter-weight option, recommended by the Mininet site, is to install the alternative desktop environment lxde; it is half the size of Ubuntu. Install it with

apt-get install xinit lxde

The standard graphical text editor included with lxde is leafpad, though of course others (eg gedit or emacs) can be installed as well.

After desktop installation, the command startx will be necessary after login to start the graphical environment (though one can automate this). A standard recommendation for new Debian-based Linux systems, before installing anything else, is

apt-get update
apt-get upgrade

Most virtual-machine software offers a special package to improve compatibility with the host system. One of the most annoying incompatibilities is the tendency of the virtual machine to grab the mouse and not allow it to be dragged outside the virtual-machine window. (Usually a special keypress releases the mouse; on VirtualBox it is the right-hand Control key and on VMWare Player it is Control-Alt.) Installation of the compatibility package (in VirtualBox called Guest Additions) usually requires mounting a CD image, with the command

mount /dev/cdrom /media/cdrom

The Mininet installation itself can be upgraded as follows:

cd /home/mininet/mininet
git fetch
git checkout master   # Or a specific version like 2.2.1
git pull
make install

The simplest environment for beginners is to install a graphical desktop (eg lxde) and then work within it. This allows seamless opening of xterm and WireShark as necessary. Enabling copy/paste between the virtual system and the host is also convenient.

However, it is also possible to work entirely without the desktop, by using multiple ssh logins with X-windows forwarding enabled:

ssh -X -l username mininet

This does require an X-server on the host system, but these are available even for Windows (see, for example, Cygwin/X). At this point one can open a graphical program on the ssh command line, eg wireshark & or gedit mininet-demo.py &, and have the program window display properly (or close to properly).

Finally, it is possible to access the Mininet virtual machine solely via ssh terminal sessions, without X-windows, though one then cannot launch xterm or WireShark.

18.2   A Simple Mininet Example

Starting Mininet via the mn command (as root!), with no command-line arguments, creates a simple network of two hosts and one switch, h1–s1–h2, and starts up the Mininet command-line interface (CLI). By convention, Mininet host names begin with ‘h’ and switch names begin with ‘s’; numbering begins with 1.

At this point one can issue various Mininet-CLI commands. The command nodes, for example, yields the following output:

available nodes are:
c0 h1 h2 s1

The node c0 is the controller for the switch s1. The default controller action her makes s1 behave like an Ethernet learning switch (2.4.1   Ethernet Learning Algorithm). The command intfs lists the interfaces for each of the nodes, and links lists the connections, but the most useful command is net, which shows the nodes, the interfaces and the connections:

h1 h1-eth0:s1-eth1
h2 h2-eth0:s1-eth2
s1 lo:  s1-eth1:h1-eth0 s1-eth2:h2-eth0

From the above, we can see that the network looks like this:

_images/simple.svg

18.2.1   Running Commands on Nodes

The next step is to run commands on individual nodes. To do this, we use the Mininet CLI and prefix the command name with the node name:

h1 ifconfig
h1 ping h2

The first command here shows that h1 (or, more properly, h1-eth0) has IP address 10.0.0.1. Note that the name ‘h2’ in the second is recognized. The ifconfig command also shows the MAC address of h1-eth0, which may vary but might be something like 62:91:68:bf:97:a0. We will see in the following section how to get more human-readable MAC addresses.

There is a special Mininet command pingall that generates pings between each pair of hosts.

We can open a full shell window on node h1 using the Mininet command below; this works for both host nodes and switch nodes.

xterm h1

Note that the xterm runs with root privileges. From within the xterm, the command ping h2 now fails, as hostname h2 is not recognized. We can switch to ping 10.0.0.2, or else add entries to /etc/hosts for the IP addresses of h1 and h2:

10.0.0.1        h1
10.0.0.2        h2

As the Mininet system shares its filesystem with h1 and h2, this means that the names h1 and h2 are now defined everywhere within Mininet (though be forewarned that when a different Mininet configuration assigns different addresses to h1 or h2, chaos will ensue).

From within the xterm on h1 we might try logging into h2 via ssh: ssh h2 (if h2 is defined in /etc/hosts as above). But the connection is refused: the ssh server is not running on node h2. We will return to this in the following example.

We can also start up WireShark, and have it listen on interface h1-eth0, and see the progress of our pings. (We can also usually start WireShark from the mininet> prompt using h1 wireshark &.)

Similarly, we can start an xterm on the switch and start WireShark there. However, there is another option, as switches by default share all their network systems with the Mininet host system. (In terms of the container model, switches do not by default get their own network namespace; they share the “root” namespace with the host.) We can see this by running the following from the Mininet command line

s1 ifconfig

and comparing the output with that of ifconfig run on the Mininet host, while Mininet is running but outside of the Mininet process itself. We see these interfaces:

eth0
lo
s1
s1-eth1
s1-eth2

We see the same interfaces on the controller node c0, even though the net and intfs commands above showed no interfaces for c0.

Running WireShark on, say, s1-eth1 is an excellent way to observe traffic on a nearly idle network; by default, the Mininet nodes are not connected to the outside world. As an example, suppose we start up xterm windows on h1 and h2, and run netcat -l 5432 on h2 and then netcat 10.0.0.2 5432 on h1. We can then watch the ARP exchange, the TCP three-way handshake, the content delivery and the connection teardown, with most likely no other traffic at all. Wireshark filtering is not needed.

18.3   Multiple Switches in a Line

The next example creates the topology below. All hosts are on the same subnet.

_images/switchline.svg

The Mininet-CLI command links can be used to determine which switch interface is connected to which neighboring switch interface.

The full Python2 program is switchline.py; to run it use

python switchline.py

This configures the network and starts the Mininet CLI. The default number of host/switch pairs is 4, but this can be changed with the -N command-line parameter, for example python switchline.py -N 5.

We next describe selected parts of switchline.py. The program starts by building the network topology object, LineTopo, extending the built-in Mininet class Topo, and then call Topo.addHost() to create the host nodes. (We here override __init()__, but overriding build() is actually more common.)

class LineTopo( Topo ):
   def __init__( self , **kwargs):
       "Create linear topology"
       super(LineTopo, self).__init__(**kwargs)
       h = []          # list of hosts; h[0] will be h1, etc
       s = []          # list of switches

       for key in kwargs:
          if key == 'N': N=kwargs[key]

       # add N hosts  h1..hN
       for i in range(1,N+1):
          h.append(self.addHost('h' + str(i)))

Method Topo.addHost() takes a string, such as “h2”, and builds a host object of that name. We immediately append the new host object to the list h[]. Next we do the same to switches, using Topo.addSwitch():

# add N switches s1..sN
for i in range(1,N+1):
   s.append(self.addSwitch('s' + str(i)))

Now we build the links, with Topo.addLink. Note that h[0]..h[N-1] represent h1..hN. First we build the host-switch links, and then the switch-switch links.

for i in range(N):               # Add links from hi to si
   self.addLink(h[i], s[i])

for i in range(N-1):            # link switches
   self.addLink(s[i],s[i+1])

Now we get to the main program. We use argparse to support the -N command-line argument.

def main(**kwargs):
    parser = argparse.ArgumentParser()
    parser.add_argument('-N', '--N', type=int)
    args = parser.parse_args()
    if args.N is None:
        N = 4
    else:
        N = args.N

Next we create a LineTopo object, defined above. We also set the log-level to ‘info’; if we were having problems we would set it to ‘debug’.

ltopo = LineTopo(N=N)
setLogLevel('info')

Finally we’re ready to create the Mininet net object, and start it. We’ve specified the type of switch here, though at this point that does not really matter. It does matter that we’re using the DefaultController, as otherwise the switches will not behave automatically as Ethernet learning switches. The autoSetMacs option sets the host MAC addresses to 00:00:00:00:00:01 through 00:00:00:00:00:04 (for N=4), which can be a great convenience when manually examining Ethernet addresses.

net = Mininet(topo = ltopo, switch = OVSKernelSwitch,
            controller = DefaultController,
            autoSetMacs = True
            )
net.start()

The next bit starts /usr/sbin/sshd on each node. This command automatically puts itself in the background; otherwise we would need to add an ‘&’ to the string to run the command in the background.

for i in range(1, N+1):
   hi = net['h' + str(i)]
   hi.cmd('/usr/sbin/sshd')

Finally we start the Mininet CLI, and, when that exits, we stop the emulation.

CLI( net)
net.stop()

Using sshd requires a small bit of configuration, if ssh for the root user has not been set up already. We must first run ssh-keygen, which creates the directory /root/.ssh and then the public and private key files, id_rsa.pub and id_rsa respectively. There is no need, in this setting, to protect the keys with a password. The second step is to go to the .ssh directory and copy id_rsa.pub to the (new) file authorized_keys (if the latter file already exists, append id_rsa.pub to it). This will allow passwordless ssh connections between the different Mininet hosts.

Because we started sshd on each host, the command ssh 10.0.0.4 on h1 should successfully connect to h4. The first time a connection is made from h1 to h4 (as root), ssh will ask for confirmation, and then store h4’s key in /root/.ssh/known_hosts. As this is the same file for all Mininet nodes, due to the common filesystem, a subsequent request to connect from h2 to h4 will succeed immediately; h4 has already been authenticated for all nodes.

18.3.1   Running a webserver

Now let’s run a web server on, say, host 10.0.0.4 of the switchline.py example above. Python includes a simple implementation that serves up the files in the directory in which it is started. After switchline.py is running, start an xterm on host h4, and then change directory to /usr/share/doc (where there are some html files). Then run the following command (the 8000 is the server port number):

python -m SimpleHTTPServer 8000

If this is run in the background somewhere, output should be redirected to /dev/null or else the server will eventually hang.

The next step is to start a browser. If the lxde environment has been installed (18.1   Installing Mininet), then the chromium browser should be available. Start an xterm on host h1, and on h1 run the following (the --no-sandbox option is necessary to run chromium as root):

chromium-browser --no-sandbox

Assuming chromium opens successfully, enter the following URL: 10.0.0.4:8000. If chromium does not start, try wget 10.0.0.4:8000, which stores what it receives as the file index.html. Either way, you should see a listing of the /usr/share/doc directory. It is possible to browse subdirectories, but only browser-recognized filetypes (eg .html) will open directly. A few directories with subdirectories named html are iperf, iptables and xarchiver; try navigating to these.

18.4   IP Routers in a Line

In the next example we build a Mininet example involving a router rather than a switch. A router here is simply a multi-interface Mininet host that has IP forwarding enabled in its Linux kernel. Mininet support for multi-interface hosts is somewhat fragile; interfaces may need to be initialized in a specific order, and IP addresses often cannot be assigned at the point when the link is created. In the code presented below we assign IP addresses using calls to Node.cmd() used to invoke the Linux command ifconfig (Mininet containers do not fully support the use of the alternative ip addr command).

Our first router topology has only two hosts, one at each end, and N routers in between; below is the diagram with N=3. All subnets are /24. The program to set this up is routerline.py, here invoked as python routerline.py -N 3. We will use N=3 in most of the examples below. A somewhat simpler version of the program, which sets up the topology specifically for N=3, is routerline3.py.

_images/routerline.svg

In both versions of the program, routing entries are created to route traffic from h1 to h2, but not back again. That is, every router has a route to 10.0.3.0/24, but only r1 knows how to reach 10.0.0.0/24 (to which r1 is directly connected). We can verify the “one-way” connectedness by running WireShark or tcpdump on h2 (perhaps first starting an xterm on h2), and then running ping 10.0.3.10 on h1 (perhaps using the Mininet command h1 ping h2). WireShark or tcpdump should show the arriving ICMP ping packets from h1, and also the arriving ICMP Destination Network Unreachable packets from r3 as h2 tries to reply (see 7.11   Internet Control Message Protocol).

It turns out that one-way routing is considered to be suspicious; one interpretation is that the packets involved have a source address that shouldn’t be possible, perhaps spoofed. Linux provides the interface configuration option rp_filter – reverse-path filter – to block the forwarding of packets for which the router does not have a route back to the packet’s source. This must be disabled for the one-way example to work; see the notes on the code below.

Despite the lack of connectivity, we can reach h2 from h1 via a hop-by-hop sequence of ssh connections (the program enables sshd on each host and router):

h1: slogin 10.0.0.2
r1: slogin 10.0.1.2
r2: slogin 10.0.2.2
r3: slogin 10.0.3.10 (that is, h3)

To get the one-way routing to work from h1 to h2, we needed to tell r1 and r2 how to reach destination 10.0.3.0/24. This can be done with the following commands (which are executed automatically if we set ENABLE_LEFT_TO_RIGHT_ROUTING = True in the program):

r1: ip route add to 10.0.3.0/24 via 10.0.1.2
r2: ip route add to 10.0.3.0/24 via 10.0.2.2

To get full, bidirectional connectivity, we can create the following routes to 10.0.0.0/24:

r2: ip route add to 10.0.0.0/24 via 10.0.1.1
r3: ip route add to 10.0.0.0/24 via 10.0.2.1

When building the network topology, the single-interface hosts can have all their attributes set at once (the code below is from routerline3.py:

h1 = self.addHost( 'h1', ip='10.0.0.10/24', defaultRoute='via 10.0.0.2' )
h2 = self.addHost( 'h2', ip='10.0.3.10/24', defaultRoute='via 10.0.3.1' )

The routers are also created with addHost(), but with separate steps:

r1 = self.addHost( 'r1' )
r2 = self.addHost( 'r2' )
...

self.addLink( h1, r1, intfName1 = 'h1-eth0', intfName2 = 'r1-eth0')
self.addLink( r1, r2, inftname1 = 'r1-eth1', inftname2 = 'r2-eth0')

Later on the routers get their IPv4 addresses:

r1 = net['r1']
r1.cmd('ifconfig r1-eth0 10.0.0.2/24')
r1.cmd('ifconfig r1-eth1 10.0.1.1/24')
r1.cmd('sysctl net.ipv4.ip_forward=1')
rp_disable(r1)

The sysctl command here enables forwarding in r1. The rp_disable(r1) call disables Linux’s default refusal to forward packets if the router does not have a route back to the packet’s source; this is often what is wanted in the real world but not necessarily in routing demonstrations. It too is ultimately implemented via sysctl commands.

18.5   IP Routers With Simple Distance-Vector Implementation

The next step is to automate the discovery of the route from h1 to h2 (and back) by using a simple distance-vector routing-update protocol. We present a partial implementation of the Routing Information Protocol, RIP, as defined in RFC 2453.

The distance-vector algorithm is described in 9.1   Distance-Vector Routing-Update Algorithm. In brief, the idea is to add a cost attribute to the forwarding table, so entries have the form (destination,next_hop,cost). Routers then send (destination,cost) lists to their neighbors; these lists are referred to the RIP specification as update messages. Routers receiving these messages then process them to figure out the lowest-cost route to each destination. The format of the update messages is diagrammed below:

_images/rip_update_message.svg

The full RIP specification also includes request messages, but the implementation here omits these. The full specification also includes split horizon, poison reverse and triggered updates (9.2.1.1   Split Horizon and 9.2.1.2   Triggered Updates); we omit these as well. Finally, while we include code for the third next_hop increase case of 9.1.1   Distance-Vector Update Rules, we do not include any test for whether a link is down, so this case is never triggered.

The implementation is in the Python3 file rip.py. Most of the time, the program is waiting to read update messages from other routers. Every UPDATE_INTERVAL seconds the program sends out its own update messages. All communication is via UDP packets sent using IP multicast, to the official RIP multicast address 224.0.0.9. Port 520 is used for both sending and receiving.

Rather than creating separate threads for receiving and sending, we configure a short (1 second) recv() timeout, and then after each timeout we check whether it is time to send the next update. An update can be up to 1 second late with this approach, but this does not matter.

The program maintains a “shadow” copy RTable of the real system forwarding table, with an added cost column. The real table is updated whenever a route in the shadow table changes. In the program, RTable is a dictionary mapping TableKey values (consisting of the IP address and mask) to TableValue objects containing the interface name, the cost, and the next_hop.

To run the program, a “production” approach would be to use Mininet’s Node.cmd() to start up rip.py on each router, eg via r.cmd('python3 rip.py &') (assuming the file rip.py is located in the same directory in which Mininet was started). For demonstrations, the program output can be observed if the program is started in an xterm on each router.

18.5.1   Multicast Programming

Sending IP multicast involves special considerations that do not arise with TCP or UDP connections. The first issue is that we are sending to a multicast group – 224.0.0.9 – but don’t have any multicast routes (multicast trees, 20.5   Global IP Multicast) configured. What we would like is to have, at each router, traffic to 224.0.0.9 forwarded to each of its neighboring routers.

However, we do not actually want to configure multicast routes; all we want is to reach the immediate neighbors. Setting up a multicast tree presumes we know something about the network topology, and, at the point where RIP comes into play, we do not. The multicast packets we send should in fact not be forwarded by the neighbors (we will enforce this below by setting TTL); the multicast model here is very local. Even if we did want to configure multicast routes, Linux does not provide a standard utility for manual multicast-routing configuration; see the ip-mroute.8 man page.

So what we do instead is to create a socket for each separate router interface, and configure the socket so that it forwards its traffic only out its associated interface. This introduces a complication: we need to get the list of all interfaces, and then, for each interface, get its associated IPv4 addresses with netmasks. (To simplify life a little, we will assume that each interface has only a single IPv4 address.) The function getifaddrdict() returns a dictionary with interface names (strings) as keys and pairs (ipaddr,netmask) as values. If ifaddrs is this dictionary, for example, then ifaddrs['r1-eth0'] might be ('10.0.0.2','255.255.255.0'). We could implement getifaddrdict() straightforwardly using the Python module netifaces, though for demonstration purposes we do it here via low-level system calls.

We get the list of interfaces using myInterfaces = os.listdir('/sys/class/net/'). For each interface, we then get its IP address and netmask (in get_ip_info(intf)) with the following:

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
SIOCGIFADDR    = 0x8915     # from /usr/include/linux/sockios.h
SIOCGIFNETMASK = 0x891b
intfpack = struct.pack('256s', bytes(intf, 'ascii'))
# ifreq, below, is like struct ifreq in /usr/include/linux/if.h
ifreq    = fcntl.ioctl(s.fileno(), SIOCGIFADDR, intfpack)
ipaddrn  = ifreq[20:24]     # 20 is the offset of the IP addr in ifreq
ipaddr   = socket.inet_ntoa(ipaddrn)
netmaskn = fcntl.ioctl(s.fileno(), SIOCGIFNETMASK, intfpack)[20:24]
netmask  = socket.inet_ntoa(netmaskn)
return (ipaddr, netmask)

We need to create the socket here (never connected) in order to call ioctl(). The SIOCGIFADDR and SIOCGIFNETMASK values come from the C language include file; the Python3 libraries do not make these constants available but the Python3 fcntl.ioctl() call does pass the values we provide directly to the underlying C ioctl() call. This call returns its result in a C struct ifreq; the ifreq above is a Python version of this. The binary-format IPv4 address (or netmask) is at offset 20.

18.5.1.1   createMcastSockets()

We are now in a position, for each interface, to create a UDP socket to be used to send and receive on that interface. Much of the information here comes from the Linux socket.7 and ip.7 man pages. The function createMcastSockets(ifaddrs) takes the dictionary above mapping interface names to (ipaddr,netmask) pairs and, for each interface intf, configures it as follows. The list of all the newly configured sockets is then returned.

The first step is to obtain the interface’s address and mask, and then convert these to 32-bit integer format as ipaddrn and netmaskn. We then enter the subnet corresponding to the interface into the shadow routing table RTable with a cost of 1 (and with a next_hop of None), via

RTable[TableKey(subnetn, netmaskn)] = TableValue(intf, None, 1)

Next we create the socket and begin configuring it, first by setting its read timeout to a short value. We then set the TTL value used by outbound packets to 1. This goes in the IPv4 header Time To Live field (7.1   The IPv4 Header); this means that no downstream routers will ever forward the packet. This is exactly what we want; RIP uses multicast only to send to immediate neighbors.

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, 1)

We also want to be able to bind the same socket source address, 224.0.0.9 and port 520, to all the sockets we are creating here (the actual bind() call is below):

sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The next call makes the socket receive only packets arriving on the specified interface:

sock.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, bytes(intf, 'ascii'))

We add the following to prevent packets sent on the interface from being delivered back to the sender; otherwise multicast delivery may do just that:

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_LOOP, False)

The next call makes the socket send on the specified interface. Multicast packets do have IPv4 destination addresses, and normally the kernel chooses the sending interface based on the IP forwarding table. This call overrides that, in effect telling the kernel how to route packets sent via this socket. (The kernel may also be able to figure out how to route the packet from the subsequent call joining the socket to the multicast group.)

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_IF, socket.inet_aton(ipaddr))

Finally we can join the socket to the multicast group represented by 224.0.0.9. We also need the interface’s IP address, ipaddr.

addrpair = socket.inet_aton('224.0.0.9')+ socket.inet_aton(ipaddr)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, addrpair)

The last step is to bind the socket to the desired address and port, with sock.bind(('224.0.0.9', 520)). This specifies the source address of outbound packets; it would fail (given that we are using the same socket address for multiple interfaces) without the SO_REUSEADDR configuration above.

18.5.2   The RIP Main Loop

The rest of the implementation is relatively nontechnical. One nicety is the use of select() to wait for arriving packets on any of the sockets created by createMcastSockets() above; the alternatives might be to poll each socket in turn with a short read timeout or else to create a separate thread for each socket. The select() call takes the list of sockets (and a timeout value) and returns a sublist consisting of those sockets that have data ready to read. Almost always, this will be just one of the sockets. We then read the data with s.recvfrom(), recording the source address src which will be used when we, next, call update_tables(). When a socket closes, it must be removed from the select() list, but the sockets here do not close; for more on this, see 18.6.1.2   dualreceive.py.

The update_tables() function takes the incoming message (parsed into a list of RipEntry objects via parse_msg()) and the IP address from which it arrives, and runs the distance-vector algorithm of 9.1.1   Distance-Vector Update Rules. TK is the TableKey object representing the new destination (as an (addr,netmask) pair). The new destination rule from 9.1.1   Distance-Vector Update Rules is applied when TK is not present in the existing RTable. The lower cost rule is applied when newcost < currentcost, and the third next_hop increase rule is applied when newcost > currentcost but currentnexthop == update_sender.

18.6   TCP Competition: Reno vs Vegas

The next routing example uses the following topology in order to emulate competition between two TCP connections h1→h3 and h2→h3. We introduce Mininet features to set, on the links, an emulated bandwidth and delay, and to set on the router an emulated queue size. Our first application will be to arrange a competition between TCP Reno (13   TCP Reno and Congestion Management) and TCP Vegas (15.6   TCP Vegas). The Python2 file for running this Mininet configuration is competition.py.

_images/tcp_competition.svg

To create links with bandwidth/delay support, we simply set Link=TCLink in the Mininet() call in main(). The TCLink class represents a Traffic Controlled Link. Next, in the topology section calls to addLink(), we add keyword parameters such as bw=BottleneckBW and delay=DELAY. To implement the bandwidth limit, Mininet then takes care of creating the virtual-Ethernet links with a rate constraint.

To implement the delay, Mininet uses a queuing hierarchy (19.7   Hierarchical Queuing). The hierarchy is managed by the tc (traffic control) command, part of the LARTC system. In the topology above, Mininet sets up h3’s queue as an htb queue (19.13.2   Linux HTB, 18.8   Linux Traffic Control (tc)) with a netem queue below it (see the man page for tc-netem.8). The latter has a delay parameter set as requested, to 110 ms in our example here. Note that this means that the delay from h3 to r will be 110 ms, and the delay from r to h3 will be 0 ms.

The queue configuration is also handled via the tc command. Again Mininet configures r’s r-eth3 interface to have an htb queue with a netem queue below it. Using the tc qdisc show command we can see that the “handle” of the netem queue is 10:; we can now set the maximum queue size to, for example, 25 with the following command on r:

tc qdisc change dev r-eth3 handle 10: netem limit 25

18.6.1   Running A TCP Competition

In order to arrange a TCP competition, we need the following tools:

  • sender.py, to open the TCP connection and send bulk data, after requesting a specific TCP congestion-control mechanism (Reno or Vegas)
  • dualreceive.py, to receive data from two connections and track the results
  • randomtelnet.py, to send random additional data to break TCP phase effects.
  • wintracker.py, to monitor the number of packets a connection has in flight (a good estimator of cwnd).

18.6.1.1   sender.py

The Python3 program sender.py is similar to tcp_stalkc.py, except that it allows specification of the TCP congestion algorithm. This is done with the following setsockopt() call:

s.setsockopt(socket.IPPROTO_TCP, TCP_CONGESTION, cong)

where cong is “reno” or “cubic” or some other available TCP flavor. The list is at /proc/sys/net/ipv4/tcp_allowed_congestion_control.

See also 15.1   Choosing a TCP on Linux.

18.6.1.2   dualreceive.py

The receiver for sender.py’s data is dualreceive.py. It listens on two ports, by default 5430 and 5431, and, when both connections have been made, begins reading. The main loop starts with a call to select(), where sset is the list of all (both) connected sockets:

sl,_,_ = select(sset, [], [])

The value sl is a sublist of sset consisting of the sockets with data ready to read. It will normally be a list consisting of a single socket, though with so much data arriving it may sometimes contain both. We then call s.recv() for s in sl, and record in either count1 or count2 the running total of bytes received.

If a sender closes a socket, this results in a read of 0 bytes. At this point dualreceive.py must close the socket, at which point it must be removed from sset as it will otherwise always appear in the sl list.

We repeatedly set a timer (in printstats()) to print the values of count1 and count2 at 0.1 second intervals, reflecting the cumulative amounts of data received by the connections. (If the variable PRINT_CUMULATIVE is set to False, then the values printed are the amounts of data received in the last 0.1 seconds.) If the TCP competition is fair, count1 and count2 should stay approximately equal. When printstats() detects no change in count1 and count2, it exits.

In Python, calling exit() only exits the current thread; the other threads keep running.

18.6.1.3   randomtelnet.py

In 16.3.4   Phase Effects we show that, with completely deterministic travel times, two competing TCP connections can have throughputs differing by a factor of as much as 10 simply because of unfortunate synchronizations of transmission times. We must introduce at least some degree of packet-arrival-time randomization in order to obtain meaningful results.

In 16.3.6   Phase Effects and overhead we used the ns2 overhead attribute for this. This is not availble in real networks, however. The next-best thing is to introduce some random telnet-like traffic, as in 16.3.7   Phase Effects and telnet traffic. This is the purpose of randomtelnet.py.

This program sends packets at random intervals; the lengths of the intervals are exponentially distributed, meaning that to find the length of the next interval we choose X randomly between 0 and 1 (with a uniform distribution), and then set the length of the wait interval to a constant times -log(X). The packet sizes are 210 bytes (a very atypical value for real telnet traffic). Crucially, the average rate of sending is held to a small fraction (by default 1%) of the available bottleneck bandwidth, which is supplied as a constant BottleneckBW. This means the randomtelnet traffic should not interfere significantly with the competing TCP connections (which, of course, have no additional interval whatsoever between packet transmissions, beyond what is dictated by sliding windows). The randomtelnet traffic appears to be quite effective at eliminating TCP phase effects.

Randomtelnet.py sends to port 5433 by default. We will usually use netcat (12.6.2   netcat again) as the receiver, as we are not interested in measuring throughput for this traffic.

18.6.1.4   Monitoring cwnd with wintracker.py

At the end of the competition, we can look at the dualreceive.py output and determine the overall throughput of each connection, as of the time when the first connection to send all its data has just finished. We can also plot throughput at intervals by plotting successive differences of the cumulative-throughput values.

However, this does not give us a view of each connection’s cwnd, which is readily available when modeling competition in a simulator. Indeed, getting direct access to a connection’s cwnd is nearly impossible, as it is a state variable in the sender’s kernel.

However, we can do the next best thing: monitor the number of packets (or bytes) a connection has in flight; this is the difference between the highest byte sent and the highest byte acknowledged. The highest byte ACKed is one less than the value of the ACK field in the most recent ACK packet, and the highest byte sent is one less than the value of the SEQ field, plus the packet length, in the most recent DATA packet.

To get these ACK and SEQ numbers, however, requires eavesdropping on the network connections. We can do this using a packet-capture library such as libpcap. The Pcapy Python2 (not Python3) module is a wrapper for libpcap.

The program wintracker.py uses Pcapy to monitor packets on the interfaces r-eth1 and r-eth2 of router r. It would be slightly more accurate to monitor on h1-eth0 and h2-eth0, but that entails separate monitoring of two different nodes, and the difference is small as the h1–r and h2–r links have negligible delay and no queuing. Wintracker.py must be configured to monitor only the two TCP connections that are competing.

The way libpcap, and thus Pcapy, works is that we first create a packet filter to identify the packets we want to capture. The filter for both connections is

host 10.0.3.10 and tcp and portrange 5430-5431

The host is, of course, h3; packets are captured if either source host or destination host is h3. Similarly, packets are captured if either the source port or the destination port is either 5430 or 5431. The connection from h1 to h3 is to port 5430 on h3, and the connection from h2 to h3 is to port 5431 on h3.

For the h1–h3 connection, each time a packet arrives heading from h1 to h3 (in the code below we determine this because the destination port dport is 5430), we save in seq1 the TCP header SEQ field plus the packet length. Each time a packet is seen heading from h3 to h1 (that is, with source port 5430), we record in ack1 the TCP header ACK field. The packets themselves are captured as arrays of bytes, but we can determine the offset of the TCP header and read the four-byte SEQ/ACK values with appropriate helper functions:

_,p = cap1.next()                       # p is the captured packet
...
(_,iphdr,tcphdr,data) = parsepacket(p)          # find the headers
sport = int2(tcphdr, TCP_SRCPORT_OFFSET)        # extract port numbers
dport = int2(tcphdr, TCP_DSTPORT_OFFSET)
if dport == port1:                              # port1 == 5430
    seq1 = int4(tcphdr, TCP_SEQ_OFFSET) + len(data)
elif sport == port1:
    ack1 = int4(tcphdr, TCP_ACK_OFFSET)

Separate threads are used for each connection, as there is no variant of select() available to return the next captured packet of either connection.

Both the SEQ and ACK fields have had ISNA added to them, but this will cancel out when we subtract. The SEQ and ACK values are subject to 32-bit wraparound, but subtraction again saves us here.

As with dualreceive.py, a timer fires every 100 ms and prints out the differences seq1-ack1 and seq2-ack2. This isn’t completely thread-safe, but it is close enough. There is some noise in the results; we can minimize that by taking the average of several differences in a row.

18.6.1.5   Synchronizing the start

The next issue is to get both senders to start at about the same time. We could use two ssh commands, but ssh commands can take several hundred milliseconds to complete. A faster method is to use netcat to trigger the start. On h1 and h2 we run shell scripts like the one below (separate values for $PORT and $CONG are needed for each of h1 and h2, which is simplest to implement with separate scripts, say h1.sh and h2.sh):

netcat -l 2345
python3 sender.py $BLOCKS 10.0.3.10 $PORT $CONG

We then start both at very close to the same time with the following on r (not on h3, due to the delay on the r–h3 link); these commands typically complete in under ten milliseconds.

echo hello | netcat h1 2345
echo hello | netcat h2 2345

The full sequence of steps is

  • On h3, start the netcat -l ... for the randomtelnet.py output (on two different ports)
  • On h1 and h2, start the randomtelnet.py senders
  • On h3, start dualreceive.py
  • On h1 and h2, start the scripts (eg h1.sh and h2.sh) that wait for the signal and start sender.py
  • On r, send the two start triggers via netcat

This is somewhat cumbersome; it helps to incorporate everything into a single shell script with ssh used to run subscripts on the appropriate host.

18.6.1.6   Reno vs Vegas results

In the Reno-Vegas graph at 16.5   TCP Reno versus TCP Vegas, we set the Vegas parameters α and β to 3 and 6 respectively. The implementation of TCP Vegas on the Mininet virtual machine does not, however, support changing α and β, and the default values are more like 1 and 3. To give Vegas a fighting chance, we reduce the queue size at r to 10 in competition.py. Here is the graph, with the packets-in-flight monitoring above and the throughput below:

_images/rv_bw2.5.svg

TCP Vegas is getting a smaller share of the bandwidth (overall about 40% to TCP Reno’s 60%), but it is consistently holding its own. It turns out that TCP Vegas is greatly helped by the small queue size; if the queue size is doubled to 20, then Vegas gets a 17% share.

In the upper part of the graph, we can see the Reno sawteeth versus the Vegas triangular teeth (sloping down as well as sloping up); compare to the red-and-green graph at 16.5   TCP Reno versus TCP Vegas. The tooth shapes are somewhat mirrored in the throughput graph as well, as throughput is proportional to queue utilization which is proportional to the number of packets in flight.

18.7   TCP Competition: Reno vs BBR

We can apply the same technique to compare TCP Reno to TCP BBR. This was done to create the graph at 15.16   TCP BBR. The Mininet approach is usable as soon as a TCP BBR module for Linux was released (in source form); to use a simulator, on the other hand, would entail waiting for TCP BBR to be ported to the simulator.

One nicety is that it is essential that the fq queuing discipline be enabled for the TCP BBR sender. If that is h2, for example, then the following Mininet code (perhaps in competition.py) removes any existing queuing discipline and adds fq:

h2.cmd('tc qdisc del dev h2-eth root')
h2.cmd('tc qdisc add dev h2-eth root fq')

The purpose of the fq queuing is to enable pacing; that is, the transmission of packets at regular, very small intervals.

18.8   Linux Traffic Control (tc)

The Linux tc command, for traffic control, allows the attachment of any implemented queuing discipline (19   Queuing and Scheduling) to any network interface (usually of a router). A hierarchical example appears in 19.13.2   Linux HTB. The tc command is also used extensively by Mininet to control, for example, link queue capacities. An explicit example, of adding the fq queuing discipline, appears immediately above.

The two examples presented in this section involve “simple” token-bucket filtering, using tbf, and then “classful” token-bucket filtering, using htb. We will use the latter example to apply token-bucket filtering only to one class of connections; other connections receive no filtering.

The granularity of tc-tbf rate control is limited by the cpu-interrupt timer granularity; typically tbf is able schedules packets every 10 ms. If the transmission rate is 6 MB/s, or about four 1500-byte packets per millisecond, then tbf will schedule 40 packets for transmission every 10 ms. They will, however, most likely be sent as a burst at the start of the 10-ms interval. Some tc schedulers are able to achieve much finer pacing control; eg the ‘fq’ qdisc of 18.7   TCP Competition: Reno vs BBR above.

The Mininet topology in both cases involves a single router between two hosts, h1—r—h2. We will here use the routerline.py example with the option -N 1; the router is then r1 with interfaces r1-eth0 connecting to h1 and r1-eth1 connecting to h2. The desired topology can also be built using competition.py and then ignoring the third host.

To send data we will use sender.py (18.6.1.1   sender.py), though with the default TCP congestion algorithm. To receive data we will use dualreceive.py, though initially with just one connection sending any significant data. We will set the constant PRINT_CUMULATIVE to False, so dualreceive.py prints at intervals the number of bytes received during the most recent interval; we will call this modified version dualreceive_incr.py. We will also redirect the stderr messages to /dev/null, and start this on h2:

python3 dualreceive_incr.py 2>/dev/null

We start the main sender on h1 with the following, where h2 has IPv4 address 10.0.1.10 and 1,000,000 is the number of blocks:

python3 sender.py 1000000 10.0.1.10 5430

The dualreceive program will not do any reading until both connections are enabled, so we also need to create a second connection from h1 in order to get started; this second connection sends only a single block of data:

python3 sender.py 1 10.0.1.10 5431

At this point dualreceive should generate output somewhat like the following (with timestamps in the first column rounded to the nearest millisecond). The byte-count numbers in the middle column are rather hardware-dependent

1.016   14079000   0
1.106   12702000   0
1.216   14724000   0
1.316   13666448   0
1.406   11877552   0

This means that, on average, h2 is receiving about 13 MB every 100ms, which is about 1.0 Gbps.

Now we run the command below on r1 to reduce the rate (tc requires the abbreviation mbit for megabit/sec; it treats mbps as MegaBytes per second). The token-bucket filter parameters are rate and burst. The purpose of the limit parameter – used by netem and several other qdiscs as well – is to specify the maximum queue size for the waiting packets. Its value here is not very significant, but too low a value can lead to packet loss and thus to momentarily plunging bandwidth. Too high a value, on the other hand, can lead to bufferbloat (14.8.1   Bufferbloat).

tc qdisc add dev r1-eth1 root tbf rate 40mbit burst 50kb limit 200kb

We get output something like this:

1.002   477840   0
1.102   477840   0
1.202   477840   0
1.302   482184   0
1.402   473496   0

477840 bytes per 100 ms is 38.2 Mbps. That is received application data; the extra 5% or so to 40 Mbps corresponds mostly to packet headers (66 bytes out of every 1514, though to see this with WireShark we need to disable TSO, 12.5   TCP Offloading).

We can also change the rate dynamically:

tc qdisc change dev r1-eth1 root tbf rate 20mbit burst 100kb limit 200kb

The above use of tbf allows us to throttle (or police) all traffic through interface r1-eth1. Suppose we want to police selected traffic only? Then we can use hierarchical token bucket, or htb. We set up an htb root node, with no limits, and then create two child nodes, one for policed traffic and one for default traffic.

_images/htb.svg

To create the htb hierarchy we will first create the root qdisc and associated root class. We need the raw interface rate, here taken to be 1000mbit. Class identifiers are of the form major:minor, where major is the integer root “handle” and minor is another integer.

tc qdisc add dev r1-eth1 root handle 1: htb default 10
tc class add dev r1-eth1 parent 1: classid 1:1 htb rate 1000mbit

We now create the two child classes (not qdiscs), one for the rate-limited traffic and one for default traffic. The rate-limited class has classid 1:2 here; the default class has classid 1:10.

tc class add dev r1-eth1 parent 1: classid 1:2 htb rate 40mbit
tc class add dev r1-eth1 parent 1: classid 1:10 htb rate 1000mbit

We still need a classifier (or filter) to assign selected traffic to class 1:2. Our goal is to police traffic to port 5430 (by default, dualreceive.py accepts traffic at ports 5430 and 5431).

There are several classifiers available; for example u32 (man tc-u32) and bpf (man tc-bpf). The latter is based on the Berkeley Packet Filter virtual machine for packet recognition. However, what we use here – mainly because it seems to work most reliably – is the iptables fwmark mechanism, used earlier in 9.6   Routing on Other Attributes. Iptables is intended for filtering – and sometimes modifying – packets; we can associate a fwmark value of 2 to packets bound for TCP port 5430 with the command below (the fwmark value does not become part of the packet; it exists only while the packet remains in the kernel).

iptables --append FORWARD --table mangle --protocol tcp --dport 5430 --jump MARK --set-mark 2

When this is run on r1, then packets forwarded by r1 to TCP port 5430 receive the fwmark upon arrival.

The next step is to tell the tc subsystem that packets with a fwmark value of 2 are to be placed in class 1:2; this is the rate-limited class above. In the following command, flowid may be used as a synonym for classid.

tc filter add dev r1-eth1 parent 1:0 protocol ip handle 2 fw classid 1:2

We can view all these settings with

tc qdisc show dev r1-eth1
tc class show dev r1-eth1
tc filter show dev r1-eth1 parent 1:1
iptables --table mangle --list

We now verify that all this works. As with tbf, we start dualreceive_incr.py on h2 and two senders on h1. This time, both senders send large amounts of data:

h2: python3 dualreceive_incr.py 2>/dev/null
h1: python3 sender.py 500000 10.0.1.10 5430
h1: python3 sender.py 500000 10.0.1.10 5431

If everything works, then shortly after the second sender starts we should see something like the output below (taken after both TCP connections have their cwnd stabilize). The middle column is the number of received data bytes to the policed port, 5430.

1.000   453224   10425600
1.100   457568   10230120
1.200   461912    9934728
1.300   476392   10655832
1.401   438744   10230120

With 66 bytes of TCP/IP headers in every 1514-byte packet, our requested 40 mbit data-rate cap should yield about 478,000 bytes every 0.1 sec. The slight reduction above appears to be related to TCP competition; the full 478,000-byte rate is achieved after the port-5431 connection terminates.

18.9   OpenFlow and the POX Controller

In this section we introduce the POX controller for OpenFlow (2.8.1   OpenFlow Switches) switches, allowing exploration of software-defined networking (2.8   Software-Defined Networking). In the switchline.py Ethernet-switch example from earlier, the Mininet() call included a parameter controller=DefaultController; this causes each switch to behave like an ordinary Ethernet learning switch. By using Pox to create customized controllers, we can investigate other options for switch operation. Pox is preinstalled on the Mininet virtual machine.

Pox is, like Mininet, written in Python2. It receives and sends OpenFlow messages, in response to events. Event-related messages, for our purposes here, can be grouped into the following categories:

  • PacketIn: a switch is informing the controller about an arriving packet, usually because the switch does not know how to forward the packet or does not know how to forward the packet without flooding. Often, but not always, PacketIn events will result in the controller providing new forwarding instructions.
  • ConnectionUP: a switch has connected to the controller. This will be the point at which the controller gives the switch its initial packet-handling instructions.
  • LinkEvent: a switch is informing the controller of a link becoming available or becoming unavailable; this includes initial reports of link availability.
  • BarrierEvent: a switch’s response to an OpenFlow Barrier message, meaning the switch has completed its responses to all messages received before the Barrier and now may begin to respond to messages received after the Barrier.

The Pox program comes with several demonstration modules illustrating how controllers can be programmed; these are in the pox/misc and pox/forwarding directories. The starting point for Pox documentation is the Pox wiki (archived copy at poxwiki.pdf), which among other thing includes brief outlines of these programs. We now review a few of these programs; most were written by James McCauley and are licensed under the Apache license.

The Pox code data structures are very closely tied to the OpenFlow Switch Specification, versions of which can be found at the OpenNetworking.org technical library.

18.9.1   hub.py

As a first example of Pox, suppose we take a copy of the switchline.py file and make the following changes:

  • change the controller specification, inside the Mininet() call, from controller=DefaultController to controller=RemoteController.
  • add the following lines immediately following the Mininet() call:
c = RemoteController( 'c', ip='127.0.0.1', port=6633 )
net.addController(c)

This modified version is available as switchline_rc.py, “rc” for remote controller. If we now run this modified version, then pings fail because the RemoteController, c, does not yet exist; in the absence of a controller, the switches’ default response is to do nothing.

We now start Pox, in the directory /home/mininet/pox, as follows; this loads the file pox/forwarding/hub.py

./pox.py forwarding.hub

Ping connectivity should be restored! The switch connects to the controller at IPv4 address 127.0.0.1 (more on this below) and TCP port 6633. At this point the controller is able to tell the switch what to do.

The hub.py example configures each switch as a simple hub, flooding each arriving packet out all other interfaces (though for the linear topology of switchline_rc.py, this doesn’t matter much). The relevant code is here:

def _handle_ConnectionUp (event):
    msg = of.ofp_flow_mod()
    msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD))
    event.connection.send(msg)

This is the handler for ConnectionUp events; it is invoked when a switch first reports for duty. As each switch connects to the controller, the hub.py code instructs the switch to forward each arriving packet to the virtual port OFPP_FLOOD, which means to forward out all other ports.

The event parameter is of class ConnectionUp, a subclass of class Event. It is defined in pox/openflow/__init__.py. Most switch-event objects throughout Pox include a connection field, which the controller can use to send messages back to the switch, and a dpid field, representing the switch identification number. Generally the Mininet switch s1 will have a dpid of 1, etc.

The code above creates an OpenFlow modify-flow-table message, msg; this is one of several types of controller-to-switch messages that are defined in the OpenFlow standard. The field msg.actions is a list of actions to be taken; to this list we append the action of forwarding on the designated (virtual) port OFPP_FLOOD.

Normally we would also append to the list msg.match the matching rules for the packets to be forwarded, but here we want to forward all packets and so no matching is needed.

A different – though functionally equivalent – approach is taken in pox/misc/of_tutorial.py. Here, the response to the ConnectionUp event involves no communication with the switch (though the connection is stored in Tutorial.__init__()). Instead, as the switch reports each arriving packet to the controller, the controller responds by telling the switch to flood the packet out every port (this approach does result in sufficient unnecessary traffic that it would not be used in production code). The code (slightly consolidated) looks something like this:

def _handle_PacketIn (self, event):
    packet = event.parsed # This is the parsed packet data.
    packet_in = event.ofp # The actual ofp_packet_in message.
    self.act_like_hub(packet, packet_in)

def act_like_hub (self, packet, packet_in):
    msg = of.ofp_packet_out()
    msg.data = packet_in
    action = of.ofp_action_output(port = of.OFPP_ALL)
    msg.actions.append(action)
    self.connection.send(msg)

The event here is now an instance of class PacketIn. This time the switch sents a packet out message to the switch. The packet and packet_in objects are two different views of the packet; the first is parsed and so is generally easier to obtain information from, while the second represents the entire packet as it was received by the switch. It is the latter format that is sent back to the switch in the msg.data field. The virtual port OFPP_ALL is equivalent to OFPP_FLOOD.

For either hub implementation, if we start WireShark on h2 and then ping from h4 to h1, we will see the pings at h2. This demonstrates, for example, that s2 is behaving like a hub rather than a switch.

18.9.2   l2_pairs.py

The next Pox example, l2_pairs.py, implements a real Ethernet learning switch. This is the pairs-based switch implementation discussed in 2.8.2   Learning Switches in OpenFlow. This module acts at the Ethernet address layer (layer 2, the l2 part of the name), and flows are specified by (src,dst) pairs of addresses. The l2_pairs.py module is started with the Pox command ./pox.py forwarding.l2_pairs.

A straightforward implementation of an Ethernet learning switch runs into a problem: the switch needs to contact the controller whenever the packet source address has not been seen before, so the controller can send back to the switch the forwarding rule for how to reach that source address. But the primary lookup in the switch flow table must be by destination address. The approach used here uses a single OpenFlow table, versus the two-table mechanism of 18.9.3   l2_nx.py. However, the learned flow table match entries will all include match rules for both the source and the destination address of the packet, so that a separate entry is necessary for each pair of communicating hosts. The number of flow entries thus scales as O(N2), which presents a scaling problem for very large switches but which we will ignore here.

When a switch sees a packet with an unmatched (dst,src) address pair, it forwards it to the controller, which has two cases to consider:

  • If the controller does not know how to reach the destination address from the current switch, it tells the switch to flood the packet. However, the controller also records, for later reference, the packet source address and its arrival interface.
  • If the controller knows that the destination address can be reached from this switch via switch port dst_port, it sends to the switch instructions to create a forwarding entry for (dst,src)→dst_port. At the same time, the controller also sends to the switch a reverse forwarding entry for (src,dst), forwarding via the port by which the packet arrived.

The controller maintains its partial map from addresses to switch ports in a dictionary table, which takes a (switch,destination) pair as its key and which returns switch port numbers as values. The switch is represented by the event.connection object used to reach the switch, and destination addresses are represented as Pox EthAddr objects.

The program handles only PacketIn events. The main steps of the PacketIn handler are as follows. First, when a packet arrives, we put its switch and source into table:

table[(event.connection,packet.src)] = event.port

The next step is to check to see if there is an entry in table for the destination, by looking up table[(event.connection,packet.dst)]. If there is not an entry, then the packet gets flooded by the same mechanism as in of_tutorial.py above: we create a packet-out message containing the to-be-flooded packet and send it back to the switch.

If, on the other hand, the controller finds that the destination address can be reached via switch port dst_port, it proceeds as follows. We first create the reverse entry; event.port is the port by which the packet just arrived:

msg = of.ofp_flow_mod()
msg.match.dl_dst = packet.src       # reversed dst and src
msg.match.dl_src = packet.dst       # reversed dst and src
msg.actions.append(of.ofp_action_output(port = event.port))
event.connection.send(msg)

This is like the forwarding rule created in hub.py, except that we here are forwarding via the specific port event.port rather than the virtual port OFPP_FLOOD, and, perhaps more importantly, we are adding two packet-matching rules to msg.match.

The next step is to create a similar matching rule for the src-to-dst flow, and to include the packet to be retransmitted. The modify-flow-table message thus does double duty as a packet-out message as well.

msg = of.ofp_flow_mod()
msg.data = event.ofp                # Forward the incoming packet
msg.match.dl_src = packet.src       # not reversed this time!
msg.match.dl_dst = packet.dst
msg.actions.append(of.ofp_action_output(port = dst_port))
event.connection.send(msg)

The msg.match object has quite a few potential matching fields; the following is taken from the Pox-Wiki:

Attribute Meaning
in_port Switch port number the packet arrived on
dl_src Ethernet source address
dl_dst Ethernet destination address
dl_type Ethertype / length (e.g. 0x0800 = IPv4)
nw_tos IPv4 TOS/DS bits
nw_proto IPv4 protocol (e.g., 6 = TCP), or lower 8 bits of ARP opcode
nw_src IPv4 source address
nw_dst IP destination address
tp_src TCP/UDP source port
tp_dst TCP/UDP destination port

It is also possible to create a msg.match object that matches all fields of a given packet.

We can watch the forwarding entries created by l2_pairs.py with the Linux program ovs-ofctl. Suppose we start switchline_rc.py and then the Pox module l2_pairs.py. Next, from within Mininet, we have h1 ping h4 and h2 ping h4. If we now run the command (on the Mininet virtual machine but from a Linux prompt)

ovs-ofctl dump-flows s2

we get

cookie=0x0, …,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:04 actions=output:3
cookie=0x0, …,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:00:02 actions=output:1
cookie=0x0, …,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:04 actions=output:3
cookie=0x0, …,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:00:01 actions=output:2

Because we used the autoSetMacs=True option in the Mininet() call in switchline_rc.py, the Ethernet addresses assigned to hosts are easy to follow: h1 is 00:00:00:00:00:01, etc. The first and fourth lines above result from h1 pinging h4; we can see from the output port at the end of each line that s1 must be reachable from s2 via port 2 and s3 via port 3. Similarly, the middle two lines result from h2 pinging h4; h2 lies off s2’s port 1. These port numbers correspond to the interface numbers shown in the diagram at 18.3   Multiple Switches in a Line.

18.9.3   l2_nx.py

The l2_nx.py example accomplishes the same Ethernet-switch effect as l2_pairs.py, but using only O(N) space. It does, however, use two OpenFlow tables, one for destination addresses and one for source addresses. In the implementation here, source addresses are held in table 0, while destination addresses are held in table 1; this is the reverse of the multiple-table approach outlined in 2.8.2   Learning Switches in OpenFlow. The l2 again refers to network layer 2, and the nx refers to the so-called Nicira extensions to Pox, which enable the use of multiple flow tables.

Initially, table 0 is set up so that it tries a match on the source address. If there is no match, the packet is forwarded to the controller, and sent on to table 1. If there is a match, the packet is sent on to table 1 but not to the controller.

Table 1 then looks for a match on the destination address. If one is found then the packet is forwarded to the destination, and if there is no match then the packet is flooded.

Using two OpenFlow tables in Pox requires the loading of the so-called Nicira extensions (hence the “nx” in the module name here). These require a slightly more complex command line:

./pox.py openflow.nicira --convert-packet-in forwarding.l2_nx

Nicira will also require, eg, nx.nx_flow_mod() instead of of.ofp_flow_mod().

The no-match actions for each table are set during the handling of the ConnectionUp events. An action becomes the default action when no msg.match() rules are included, and the priority is low; recall (2.8.1   OpenFlow Switches) that if a packet matches multiple flow-table entries then the entry with the highest priority wins. The priority is here set to 1; the Pox default priority – which will be used (implicitly) for later, more-specific flow-table entries – is 32768. The first step is to arrange for table 0 to forward to the controller and to table 1.

msg = nx.nx_flow_mod()
msg.table_id = 0              # not necessary as this is the default
msg.priority = 1              # low priority
msg.actions.append(of.ofp_action_output(port = of.OFPP_CONTROLLER))
msg.actions.append(nx.nx_action_resubmit.resubmit_table(table = 1))
event.connection.send(msg)

Next we tell table 1 to flood packets by default:

msg = nx.nx_flow_mod() msg.table_id = 1 msg.priority = 1 msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD)) event.connection.send(msg)

Now we define the PacketIn handler. First comes the table 0 match on the packet source; if there is a match, then the source address has been seen by the controller, and so the packet is no longer forwarded to the controller (it is forwarded to table 1 only).

msg = nx.nx_flow_mod()
msg.table_id = 0
msg.match.of_eth_src = packet.src     # match the source
msg.actions.append(nx.nx_action_resubmit.resubmit_table(table = 1))
event.connection.send(msg)

Now comes table 1, where we match on the destination address. All we know at this point is that the packet with source address packet.src came from port event.port, and we forward any packets addressed to packet.src via that port:

msg = nx.nx_flow_mod() msg.table_id = 1 msg.match.of_eth_dst = packet.src # this rule applies only for packets to packet.src msg.actions.append(of.ofp_action_output(port = event.port)) event.connection.send(msg)

Note that there is no network state maintained at the controller; there is no analog here of the table dictionary of l2_pairs.py.

Suppose we have a simple network h1–s1–h2. When h1 sends to h2, the controller will add to s1’s table 0 an entry indicating that h1 is a known source address. It will also add to s1’s table 1 an entry indicating that h1 is reachable via the port on s1’s left. Similarly, when h2 replies, s1 will have h2 added to its table 0, and then to its table 1.

18.9.4   multitrunk.py

The goal of the multitrunk example is to illustrate how different TCP connections between two hosts can be routed via different paths; in this case, via different “trunk lines”. This example and the next are not part of the standard distributions of either Mininet or Pox. Unlike the other examples discussed here, these examples consist of Mininet code to set up a specific network topology and a corresponding Pox controller module that is written to work properly only with that topology. Most real networks evolve with time, making such a tight link between topology and controller impractical (though this may sometimes work well in datacenters). The purpose here, however, is to illustrate specific OpenFlow possibilities in a (relatively) simple setting.

The multitrunk topology involves multiple “trunk lines” between host h1 and h2, as in the following diagram; the trunk lines are the s1s3 and s2s4 links.

_images/multitrunk12.svg

The Mininet file is multitrunk12.py and the corresponding Pox module is multitrunkpox.py. The number of trunk lines is K=2 by default, but can be changed by setting the variable K. We will prevent looping of broadcast traffic by never flooding along the s2s4 link.

TCP traffic takes either the s1s3 trunk or the s2s4 trunk. We will refer to the two directions h1h2 and h2h1 of a TCP connection as flows, consistent with the usage in 8.1   The IPv6 Header. Only h1h2 flows will have their routing vary; flows h2h1 will always take the s1s3 path. It does not matter if the original connection is opened from h1 to h2 or from h2 to h1.

The first TCP flow from h1 to h2 goes via s1s3. After that, subsequent connections alternate in round-robin fashion between s1s3 and s2s4. To achieve this we must, of course, include TCP ports in the OpenFlow forwarding information.

All links will have a bandwidth set in Mininet. This involves using the link=TCLink option; TC here stands for Traffic Control. We do not otherwise make use of the bandwidth limits. TCLinks can also have a queue size set, as in 18.6   TCP Competition: Reno vs Vegas.

For ARP and ICMP traffic, two OpenFlow tables are used as in 18.9.3   l2_nx.py. The PacketIn messages for ARP and ICMP packets are how switches learn of the MAC addresses of hosts, and also how the controller learns which switch ports are directly connected to hosts. TCP traffic is handled differently, below.

During the initial handling of ConnectionUp messages, switches receive their default packet-handling instructions for ARP and ICMP packets, and a SwitchNode object is created in the controller for each switch. These objects will eventually contain information about what neighbor switch or host is reached by each switch port, but at this point none of that information is yet available.

The next step is the handling of LinkEvent messages, which are initiated by the discovery module. This module must be included on the ./pox.py command line in order for this example to work. The discovery module sends each switch, as it connects to the controller, a special discovery packet in the Link Layer Discovery Protocol (LLDP) format; this packet includes the originating switch’s dpid value and the switch port by which the originating switch sent the packet. When an LLDP packet is received by the neighboring switch, that switch forwards it back to the controller, together with the dpid and port for the receiving switch. At this point the controller knows the switches and port numbers at each end of the link. The controller then reports this to our multitrunkpox module via a LinkEvent event.

As LinkEvent messages are processed, the multitrunkpox module learns, for each switch, which ports connect directly to neighboring switches. At the end of the LinkEvent phase, which generally takes several seconds, each switch’s SwitchNode knows about all directly connected neighbor switches. Nothing is yet known about directly connected neighbor hosts though, as hosts have not yet sent any packets.

Once hosts h1 and h2 exchange a pair of packets, the associated PacketIn events tell multitrunkpox what switch ports are connected to hosts. Ethernet address learning also takes place. If we execute h1 ping h2, for example, then afterwards the information contained in the SwitchNode graph is complete.

Now suppose h1 tries to open a TCP connection to h2, eg via ssh. The first packet is a TCP SYN packet. The switch s5 will see this packet and forward it to the controller, where the PacketIn handler will process it. We create a flow for the packet,

flow = Flow(psrc, pdst, ipv4.srcip, ipv4.dstip, tcp.srcport, tcp.dstport)

and then see if a path has already been assigned to this flow in the dictionary flow_to_path. For the very first packet this will never be the case. If no path exists, we create one, first picking a trunk:

trunkswitch = picktrunk(flow)
path = findpath(flow, trunkswitch)

The first path will be the Python list [h1, s5, s1, s3, s6, h2], where the switches are represented by SwitchNode objects.

The supposedly final step is to call

result = create_path_entries(flow, path)

to create the forwarding rules for each switch. With the path as above, the SwitchNode objects know what port s5 should use to reach s1, etc. Because the first TCP SYN packet must have been preceeded by an ARP exchange, and because the ARP exchange will result in s6 learning what port to use to reach h2, this should work.

But in fact it does not, at least not always. The problem is that Pox creates separate internal threads for the ARP-packet handling and the TCP-packet handling, and the former thread may not yet have installed the location of h2 into the appropriate SwitchNode object by the time the latter thread calls create_path_entries() and needs the location of h2. This race condition is unfortunate, but cannot be avoided. As a fallback, if creating a path fails, we flood the TCP packet along the s1s3 link (even if the chosen trunk is the s2s4 link) and wait for the next TCP packet to try again. Very soon, s6 will know how to reach h2, and so create_path_entries() will succeed.

If we run everything, create two xterms on h1, and then create two ssh connections to h2, we can see the forwarding entries using ovs-ofctl. Let us run

ovs-ofctl dump-flows s5

Restricting attention only to those flow entries with foo=tcp, we get (with a little sorting)

cookie=0x0, …, tcp,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.2,tp_src=59404,tp_dst=22 actions=output:1
cookie=0x0, …, tcp,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.2,tp_src=59526,tp_dst=22 actions=output:2
cookie=0x0, …, tcp,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.2,nw_dst=10.0.0.1,tp_src=22,tp_dst=59404 actions=output:3
cookie=0x0, …, tcp,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.2,nw_dst=10.0.0.1,tp_src=22,tp_dst=59526 actions=output:3

The first two entries represent the h1h2 flows. The first connection has source TCP port 59404 and is routed via the s1s3 trunk; we can see that the output port from s5 is port 1, which is indeed the port that s5 uses to reach s1 (the output of the Mininet links command includes s5-eth1<->s1-eth2). Similarly, the output port used at s5 by the second connection, with source TCP port 59526, is 2, which is the port s5 uses to reach s2. The switch s5 reaches host h1 via port 3, which can be seen in the last two entries above, which correspond to the reverse h2h1 flows.

The OpenFlow timeout here is infinite. This is not a good idea if the system is to be running indefinitely, with a steady stream of short-term TCP connections. It does, however, make it easier to view connections with ovs-ofctl before they disappear. A production implementation would need a finite timeout, and then would have to ensure that connections that were idle for longer than the timeout interval were properly re-established when they resumed sending.

The multitrunk strategy presented here can be compared to Equal-Cost Multi-Path routing, 9.7   ECMP. In both cases, traffic is divided among multiple paths to improve throughput. Here, individual TCP connections are assigned a trunk by the controller (and can be reassigned at will, perhaps to improve the load balance). In ECMP, it is common to assign TCP connections to paths via a pseudorandom hash, in which case the approach here offers the potential for better control of the distribution of traffic among the trunk links. In some configurations, however, ECMP may route packets over multiple links on a round-robin packet-by-packet basis rather than a connection-by-connection basis; this allows much better load balancing.

OpenFlow has low-level support for this approach in the select group mechanism. A flow-table traffic-matching entry can forward traffic to a so-called group instead of out via a port. The action of a select group is then to select one of a set of output actions (often on a round-robin basis) and apply that action to the packet. In principle, we could implement this at s5 to have successive packets sent to either s1 or s2 in round-robin fashion. In practice, Pox support for select groups appears to be insufficiently developed at the time of this writing (2017) to make this practical.

18.9.5   loadbalance31.py

The next example demonstrates a simple load balancer. The topology is somewhat the reverse of the previous example: there are now three hosts (N=3) at each end, and only one trunk line (K=1) (there are also no left- and right-hand entry/exit switches). The right-hand hosts act as the “servers”, and are renamed t1, t2 and t3.

_images/loadbalance.svg

The servers all get the same IPv4 address, 10.0.0.1. This would normally lead to chaos, but the servers are not allowed to talk to one another, and the controller ensures that the servers are not even aware of one another. In particular, the controller makes sure that the servers never all simultaneously reply to an ARP “who-has 10.0.0.1” query from r.

The Mininet file is loadbalance31.py and the corresponding Pox module is loadbalancepox.py.

The node r is a router, not a switch, and so its four interfaces are assigned to separate subnets. Each host is on its own subnet, which it shares with r. The router r then connects to the only switch, s; the connection from s to the controller c is shown.

The idea is that each TCP connection from any of the hi to 10.0.0.1 is connected, via s, to one of the servers ti, but different connections will connect to different servers. In this implementation the server choice is round-robin, so the first three TCP connections will connect to t1, t2 and t3 respectively, and the fourth will connect again to t1.

The servers t1 through t3 are configured to all have the same IPv4 address 10.0.0.1; there is no address rewriting done to packets arriving from the left. However, as in the preceding example, when the first packet of each new TCP connection from left to right arrives at s, it is forwarded to c which then selects a specific ti and creates in s the appropriate forwarding rule for that connection. As in the previous example, each TCP connection involves two Flow objects, one in each direction, and separate OpenFlow forwarding entries are created for each flow.

There is no need for paths; the main work of routing the TCP connections looks like this:

server = pickserver(flow)
flow_to_server[flow] = server
addTCPrule(event.connection, flow, server+1)        # ti is at port i+1
addTCPrule(event.connection, flow.reverse(), 1)     # port 1 leads to r

The biggest technical problem is ARP: normally, r and the ti would contact one another via ARP to find the appropriate LAN addresses, but that will not end well with identical IPv4 addresses. So instead we create “static” ARP entries. We know (by checking) that the MAC address of r-eth0 is 00:00:00:00:00:04, and so the Mininet file runs the following command on each of the ti:

arp -s 10.0.0.2 00:00:00:00:00:04

This creates a static ARP entry on each of the ti, which leaves them knowing the MAC address for their default router 10.0.0.2. As a result, none of them issues an ARP query to find r. The other direction is similar, except that r (which is not really in on the load-balancing plot) must think 10.0.0.1 has a single MAC address. Therefore, we give each of the ti the same MAC address (which would normally lead to even more chaos than giving them all the same IPv4 address); that address is 00:00:00:00:01:ff. We then install a permanent ARP entry on r with

arp -s 10.0.0.1 00:00:00:00:01:ff

Now, when h1, say, sends a TCP packet to 10.0.0.1, r forwards it to MAC address 00:00:00:00:01:ff, and then s forwards it to whichever of t1..t3 it has been instructed by the controller c to forward it to. The packet arrives at ti with the correct IPv4 address (10.0.0.1) and correct MAC address (00:00:00:00:01:ff), and so is accepted. Replies are similar: ti sends to r at MAC address 00:00:00:00:00:04.

As part of the ConnectionUp processing, we set up rules so that ICMP packets from the left are always routed to t1. This way we have a single responder to ping requests. It is entirely possible that some important ICMP message – eg Fragmentation required but DF flag set – will be lost as a result.

If we run the programs and create xterm windows for h1, h2 and h3 and, from each, connect to 10.0.0.1 via ssh, we can tell that we’ve reached t1, t2 or t3 respectively by running ifconfig. The Ethernet interface on t1 is named t1-eth0, and similarly for t2 and t3. (Finding another way to distinguish the ti is not easy.) An even simpler way to see the connection rotation is to run h1 ssh 10.0.0.1 ifconfig at the mininet> prompt several times in succession, and note the successive interface names.

If we create three connections and then run ovs-ofctl dump-flows s and look at tcp entries with destination address 10.0.0.1, we get this:

cookie=0x0, …, tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.1.1,nw_dst=10.0.0.1,tp_src=35110,tp_dst=22 actions=output:2
cookie=0x0, …, tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.2.1,nw_dst=10.0.0.1,tp_src=44014,tp_dst=22 actions=output:3
cookie=0x0, …, tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.3.1,nw_dst=10.0.0.1,tp_src=55598,tp_dst=22 actions=output:4

The three different flows take output ports 2, 3 and 4 on s, corresponding to t1, t2 and t3.

18.9.6   l2_multi.py

This final Pox controller example takes an arbitrary Mininet network, learns the topology, and then sets up OpenFlow rules so that all traffic is forwarded by the shortest path, as measured by hopcount. OpenFlow packet-forwarding rules are set up on demand, when traffic between two hosts is first seen.

This module is compatible with topologies with loops, provided the spanning_tree module is also loaded.

We start with the spanning_tree module. This uses the openflow.discovery module, as in 18.9.4   multitrunk.py, to build a map of all the connections, and then runs the spanning-tree algorithm of 2.5   Spanning Tree Algorithm and Redundancy. The result is a list of switch ports on which flooding should not occur; flooding is then disabled by setting the OpenFlow NO_FLOOD attribute on these ports. We can see the ports of a switch s that have been disabled via NO_FLOOD by using ovs-ofctl show s.

One nicety is that the spanning_tree module is never quite certain when the network is complete. Therefore, it recalculates the spanning tree after every LinkEvent.

We can see the spanning_tree module in action if we create a Mininet network of four switches in a loop, as in exercise 9.0 below, and then run the following:

./pox.py openflow.discovery openflow.spanning_tree forwarding.l2_pairs

If we run ovs-ofctl show for each switch, we get something like the following:

s1: (s1-eth2): … NO_FLOOD
s2: (s2-eth2): … NO_FLOOD

We can verify with the Mininet links command that s1-eth2 and s2-eth2 are connected interfaces. We can verify with tcpdump -i s1-eth2 that no packets are endlessly circulating.

We can also verify, with ovs-ofctl dump-flows, that the s1s2 link is not used at all, not even for s1s2 traffic. This is not surprising; the l2_pairs learning strategy learns ultimately learns source addresses from flooded ARP packets, which are not sent along the s1s2 link. If s1 hears nothing from s2, it will never learn to send anything to s2.

The l2_multi module, on the other hand, creates a full map of all network links (separate from the map created by the spanning_tree module), and then calculates the best route between each pair of hosts. To calculate the routes, l2_multi uses the Floyd-Warshall algorithm (outlined below), which is a form of the distance-vector algorithm optimized for when a full network map is available. (The shortest-path algorithm of 9.5.1   Shortest-Path-First Algorithm might be a faster choice.) To avoid having to rebuild the forwarding map on each LinkEvent, l2_multi does not create any routes until it sees the first packet (not counting LLDP packets). By that point, usually the network is stable.

If we run the example above using the Mininet rectangle topology, we again find that the spanning tree has disabled flooding on the s1s2 link. However, if we have h1 ping h2, we see that h1h2 traffic does take the s1s2 link. Here is part of the result of ovs-ofctl dump-flows s1:

cookie=0x0, …, priority=65535,icmp,in_port=1,…,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.2,…,icmp_type=8… actions=output:2
cookie=0x0, …, priority=65535,icmp,in_port=1,…0,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.2,…,icmp_type=0… actions=output:2

Note that l2_multi creates separate flow-table rules not only for ARP and ICMP, but also for ping (icmp_type=8) and ping reply (icmp_type=0). Such fine-grained matching rules are a matter of preference.

Here is a brief outline of the Floyd-Warshall algorithm. We assume that the switches are numbered {1,…,N}. The outer loop has the form for k<=N:; at the start of stage k, we assume that we’ve found the best path between every i and j for which every intermediate switch on the path is less than k. For many (i,j) pairs, there may be no such path.

At stage k, we examine, with an inner loop, all pairs (i,j). We look to see if there is a path from i to k and a second path from k to j. If there is, we concatenate the i-to-k and k-to-j paths to create a new i-to-j path, which we will call P. If there was no previous i-to-j path, then we add P. If there was a previous i-to-j path Q that is longer than P, we replace Q with P. At the end of the k=N stage, all paths have been discovered.

18.10   Exercises

Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 2.5 is distinct, for example, from exercises 2.0 and 3.0. Exercises marked with a ♢ have solutions or hints at 24.13   Solutions for Mininet.

1.0. In the RIP implementation of 18.5   IP Routers With Simple Distance-Vector Implementation, add Split Horizon (9.2.1.1   Split Horizon).

2.0. In the RIP implementation of 18.5   IP Routers With Simple Distance-Vector Implementation, add support for link failures (the third rule of 9.1.1   Distance-Vector Update Rules)

3.0. Explain why, in the example of 18.9.3   l2_nx.py, table 0 and table 1 will always have the same entries.

4.0. Suppose we try to eliminate the source addresses from the l2_pairs implementation.

  • by default, all switches report all packets to the controller, and the controller then tells the switch to flood the packet.
  • if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach hb from S, then the controller installs forwarding rules into S for destination hb. The controller then tells S to re-forward the packet. In the future, S will not report packets to hb to the controller.
  • when S reports to the controller a packet from ha to hb, then the controller notes that ha is reachable via the port on S by which the packet arrived.

Why does this not work? Hint: consider the switchline example (18.3   Multiple Switches in a Line), with h1 sending to h4, h4 sending to h1, h3 sending to h1, and finally h1 sending to h3.

5.0. Suppose we make the following change to the above strategy:

  • if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach both ha and hb from S, then the controller installs forwarding rules into S for destinations ha and hb. The controller then tells S to re-forward the packet. In the future, S will not report packets to ha or hb to the controller.

Show that this still does not work for the switchline example.

6.0. Suppose we try to implement an Ethernet switch as follows:

  • the default switch action for an unmatched packet is to flood it and send it to the controller.
  • if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach both ha and hb from S, then the controller installs forwarding rules into S for destinations ha and hb. In the future, S will not report packets with these destinations to the controller.
  • Unlike in exercise 4.0, the controller then tells S to flood the packet from ha to hb, even though it could be forwarded directly.

Traffic is sent in the network below:

h1     h2     h3
│      │      │
s1─────s2─────s3

(a)♢. Show that, if the traffic is as follows: h1 pings h2, h3 pings h1, then all three switches learn where h3 is.

(b). Show that, if the traffic is as follows: h1 pings h2, h1 pings h3, then none of the switches learn where h3 is.

Recall that each ping for a new destination starts with a broadcast ARP. Broadcast packets are always sent to the controller, as there is no destination match.

7.0. In 18.9.5   loadbalance31.py, we could have configured the ti to have default router 10.0.0.3, say, and then created the appropriate static ARP entry for 10.0.0.3:

ip route add to default via 10.0.0.3 dev ti-eth0
arp -s 10.0.0.3 00:00:00:00:00:04

Everything still works, even though the ti think their router is at 10.0.0.3 and it is actually at 10.0.0.2. Explain why. (Hint: how is the router IPv4 address actually used by the ti?)

8.0. As discussed in the text, a race condition can arise in the example of 18.9.4   multitrunk.py, where at the time the first TCP packet the controller still does not know where h2 is, even though it should learn that after processing the first ARP packet.

Explain why a similar race condition cannot occur in 18.9.5   loadbalance31.py.

9.0. Create a Mininet network with four hosts and four switches as below:

h1────s1────────s2────h2
       │        │
       │        │
h4────s4────────s3────h3

The switches should use an external controller. Now let Pox be that controller, with

./pox.py openflow.discovery openflow.spanning_tree l2_pairs.py

10.0. Create the topology below with Mininet. Run the l2_multi Pox module as controller, with the openflow.spanning_tree option, and identify the spanning tree created. Also identify the path taken by icmp traffic from h1 to h2.

_images/3x4.svg