Wednesday, August 8, 2018

Babel: a routing protocol with wireless support

by Craig Miller


Traffic
Previously I wrote about resurrecting the old forgotten routing protocol, RIPng. In a small network of more than one router, you need a routing protocol to share information between the routers. I used RIPng for about six months, turned it on, and pretty much forgot that it was running. Worked like a charm in my wired network.
I moved to a new (to me) house this summer, and thought it was a good opportunity to try out a routing protocol which not only handles wired networks but also wireless. Babel seemed just the thing for this environment.

Enter Babel


Babel is a loop-avoiding distance-vector routing protocol that is robust and efficient both in ordinary wired networks and in wireless mesh networks. Based on the loss of hellos the cost of wireless links can be increased, making sketchy wireless links less preferred.
RFC 6126 standardizes the routing protocol.There are two implementations which are supported on OpenWrt routers, babeld and bird


Creating a network with redundant paths


Like anything in networking, it starts with the physical layer (wireless is a form of physical layer). I attached the wireless links of the backup link router to the production and test routers. Thus creating redundant path of connectivity within my house.
Network Diagram

Running BIRD with Babel


I chose bird6 (the IPv6 version of bird on OpenWrt) because I already had it installed on the routers for RIPng. It was merely a matter of commenting out the RIP section in the /etc/bird6.conf file, and enabling Babel.
The Bird Documentation provides an example. Add the following to /etc/bird6.conf get Babel running in bird6
protocol babel {
    interface "wlan0", "wlan1" {
        type wireless;
        hello interval 1;
        rxcost 512;
    };
    interface "br-lan" {
        type wired;
    };
    import all;
    export all;
}
In the example above, wlan0 is the 2.4 Ghz radio, and wlan1 is the 5 Ghz radio.

Checking the path of connectivity


When determining the connectivity path, traceroute6 (the IPv6 version) is your friend. Checking between the laptop and the DNS server, the path is:
$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 49819, 30 hops max, 60 bytes packets
 1  2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1)  4.561 ms  0.510 ms  0.487 ms 
 2  2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1)  2.562 ms  2.193 ms  1.927 ms 
$ 
The traceroute is showing the path going clockwise through the 2.4 Ghz wireless link.

Network Failure!


To test how well Babel can automatically route around failed links, I started a ping to the DNS server from the laptop and disabled the 2.4 Ghz radio, thus blocking the link the pings were using, and waited...
$ ping6 6dns
PING 6dns(2001:db8:ebbd:4118::1) 56 data bytes
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=1 ttl=63 time=3.54 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=2 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=3 ttl=63 time=2.02 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=4 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=5 ttl=63 time=1.51 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=6 ttl=63 time=1.65 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=7 ttl=63 time=1.58 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=8 ttl=63 time=5.80 ms
From 2001:db8:ebbd:bac0::1 icmp_seq=33 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=34 Destination unreachable: No route
...
From 2001:db8:ebbd:bac0::1 icmp_seq=48 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=49 Destination unreachable: No route
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=101 ttl=61 time=2.12 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=102 ttl=61 time=3.42 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=103 ttl=61 time=3.16 ms

As you can see the outage was 93 seconds (101 - 8). Not a record time, OSPF would converge much faster, but still it did fix itself without human intervention.
Checking the connectivity path with traceroute6:
$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 47725, 30 hops max, 60 bytes packets
 1  2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1)  0.541 ms  0.445 ms  0.437 ms 
 2  2001:db8:ebbd:2080::1 (2001:db8:ebbd:2080::1)  1.705 ms  1.832 ms  1.817 ms 
 3  2001:db8:ebbd:2000::1 (2001:db8:ebbd:2000::1)  2.273 ms  1.891 ms  2.584 ms 
 4  2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1)  2.348 ms  2.822 ms  2.289 ms 
$
The path can now be seen to be traveling counter-clockwise around the circle via the 5 Ghz link. The Babel routing protocol is routing packets around the failure.

Wireless is great, except ...


As more and more things come online using wireless there will be more interference and contention for bandwidth, especially in the 2.4 Ghz band. Babel can enables routing of packets around sketchy wireless links due to interference in a crowded wifi environment.

Your Metric may vary


Because wireless is variable, Babel applies differing metrics to routes as the wireless signal changes. An unfortunate side effect of this is that the network is continuously converging (or changing). The route that may have been used last minute to the remote host, my be invalid the next minute.
I noticed this as my previously very stable IPv6-only servers were now disconnecting, or worse, not reachable.

Route Flapping!


As I looked at the OpenWrt syslog (using the logread command) I could see that the routes were continually changing.
Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:46 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:02 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:34 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:50 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:54 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
...

The problem with this route flapping is that it was being propagated to the other routers which were busy adding and removing routes, causing unreachable to parts of my network. Not a desired behaviour.

Settling things down


To rid my network of the route churn, I changed the Babel wireless interfaces to wired, giving them a stable metric, no longer tied to the variability of the wireless signal quality (signal to noise).
The /etc/bird6.conf now looks like:
protocol babel {
    interface "wlan0", "wlan1" {
        type wired;
        hello interval 5;
    };
    interface "br-lan" {
        type wired;
    };
    import all;
    export all;
}
Restarting bird6, and looking at the syslog, a brief activity can be seen, then the route churn stops, and the network is stable.

My ssh connection was dropped as the network did an initial reconverge, and then I was able log back in and examine the syslog.

Babel, still a work in progress


Babel is still being actively developed, and has a more modern approach to wireless links (something that was near non-existent when RIPng was being standardized back in 1997). Like RIPng, it is easy to set up without having to understand the complexities of OSPF. It is easy to setup on OpenWrt routers and provides redundancy in your network. That said the wireless functionality as implemented by Bird (v 1.63) is not quite there. Fortunately, there is Bird v2.0 out, and I look forward to giving it a try when it comes to OpenWrt.

Postscript


Although the route churn has subsided, I re-measured the convergence time for Babel, and it was quite long, 317 seconds, probably due to the hello timer being set to 5 seconds.
In the end, I reverted my house network to RIPng. Running the same convergence test yielded an outage of only 11 seconds with no route churn.
Perhaps many of the Babel issues are just Bird's implementation. And there may be tweaks to reduce network converge times. I'd happily give Babel another chance, but for now, I'll stick with good ol' RIPng.



** if you are running a firewall, the default on OpenWrt/LEDE, you will need to put in a rule to accept IPv6 UDP port 6696

Originally posted (with more detail) on www.makikiweb.com