A few months ago, I wrote about putting vlans of a bridge into other bridges. This turned out to be a dumb idea, because:

  1. it only gave access to half the bandwidth - only one of the two interfaces on each server is active at any given time

  2. it *blush* doesn't actually work - if you attach bridge1.123 as a port to, say, bridge v123, then attach a veth device from a VM to v123, turns out only broadcasts can cross the bridge. If you give v123 an IP address, it can reach the vm and the world, but the vm itself can't reach the world.

So that got the big fail. If the Dell switches (and Linux) had per vlan STP, that might have helped, if there was a Linux implementation of MST, that would definitely have helped. If my servers had more physical ports, that would have been useful. Enough with the woulda coulda shoulda.

As a work around, I went back to a more classic build for a Pacemaker/DRBD cluster - eth0 on boxa attached to switcha, eth0 on boxb attached to switchb, eth1 on each box tied with a direct cable, switches likewise. Front end traffic on eth0, DRBD and pacemaker traffic on eth1.

With the correct pacemaker setup to detect upstream router unreachability, this works OK, but it isn't optimal - if a switch dies, or the cross over cable gets pulled, or an ethernet interface dies, then one of the servers will get failed out, and all the load will shift onto the other server. It feels a bit blunt instrument to fail out a server because of an upstream fault - it would be better if both servers could keep running, albeit in an impaired fashion with reduced bandwidth.

While reading up on the Totem support for dual redundant rings, I happened across a post that essentially said "most people don't use redundant rings, they just use bonding". That got me to wondering.

Traditional load balancing bonding (lacp/802.3ad) for extra bandwidth doesn't work well between a server and two different switches, assuming you can get it to stand up at all. The MAC address of the server is learnt on both switches, and all sorts of wackiness will ensue. Actve-standby bonding for high availability, where only one link is active at any given time, should work to multiple switches, but we're back at problem 1 above - we've halved the available bandwidth.

That got me wondering if I could bond vlan sub interfaces together. Initial signs aren't promising - the 'tubes are full of information about how to make vlan sub interfaces of a bond, but there's very little on how take vlan sub interfaces of an ethernet port and bond them together. The only post I could find suggested it wasn't going to work, but didn't say why. Conceptually, there are some limitations - load balancing to multiple switches is still going to be problematic, and anything that requires communication between the switch and the server (LACP or similar) is not going to work, as the switch is not expecting the control packets to arrive tagged.

Load balancing often doesn't give you the behaviour you're expecting, particularly if there's a small number of MAC addresses at either end of the bonded link, and debugging it is a dark art, so I'm happy to steer well clear of it. In this case, I don't need the extra capacity of load balancing, I just want traffic separation - 1Gb for DRBD, and 1Gb for everything else, with sharing if one of the links goes down.

So my thought was that if I could make half a dozen matching vlans each on eth0 and eth1 and bond them into six different bond devices each running active-backup, then I could get some coarse load balancing by forcing the DRBD bond to favour the eth1.xxx slave, and all the other bonds to favour the eth0.xxx slave.

This, somewhat suprisingly, works an absolute charm. My config (for Debian squeeze) looks something like this:

# pacemaker vlan:
auto eth0.253
iface eth0.253 inet manual

auto eth1.253
iface eth1.253 inet manual

auto bond253
iface bond253 inet static
  address 192.168.253.11
  netmask 255.255.255.0
  bond-slaves eth0.253 eth1.253
  bond-mode active-backup
  bond-primary eth0.253
  bond_arp_ip_target 192.168.253.12
  bond_arp_interval 500
  bond_arp_validate 3
  up ip link set eth0.253 mtu 9000
  up ip link set eth1.253 mtu 9000
  up ip link set bond253 mtu 9000

# DRBD vlan:
auto eth0.254
iface eth0.254 inet manual

auto eth1.254
iface eth1.254 inet manual

auto bond254
iface bond254 inet static
  address 192.168.254.11
  netmask 255.255.255.0
  bond-slaves eth0.254 eth1.254
  bond-mode active-backup
  bond-primary eth1.254
  bond_arp_ip_target 192.168.254.12
  bond_arp_interval 500
  bond_arp_validate 3
  up ip link set eth0.254 mtu 9000
  up ip link set eth1.254 mtu 9000
  up ip link set bond254 mtu 9000

It's a slightly clumsy syntax (the mtu handling of bond interfaces is a bit odd), and there's an implicit race - the ethn.xxx vlan interfaces need to be created before the bond interface, in practice on bootup Debian runs top to bottom down the file, and I'm not going to be hotplugging, so it's a race I can live with. All good, wish I'd thought of it three months ago!

If you want the active-standby system to actually fail over when there is a problem, there are two failure detection mechanisms in the kernel - miimon, which watches physical ethernet link state, and arp monitoring, where each site sprays arp requests out on to the network and listens to replies. Miimon is a local physical link test, while the arp monitor is an end to end test.

My experience of switch failures (and I've seen a lot more than I like to admit) is that when they fail, they very rarely have the grace to shut down ethernet ports as they go. So while the miimon code will catch a switch getting powered down, it won't be much use if the switch has a software crash, or a human removes a vlan from an interface/switch. Therefore, in most real world situations the arp monitoring approach looks superior, so long as you can live with its limitations, and are emotionally prepared for the amount of broadcast traffic it sprays around within each vlan.

The major downside of the arp monitor is that it requires that the bond interface have a valid ipv4 config. It presumably doesn't work in a v6-only environment, and it certainly doesn't work if you want to take the bond device and put it into a bridge - if you do something like this:

auto eth0.102
iface eth0.102 inet manual

auto eth1.102
iface eth1.102 inet manual

auto bond102
iface bond102 inet manual
  bond-slaves eth0.102 eth1.102
  bond-mode active-backup
  bond-primary eth0.102
  bond_arp_ip_target 10.1.1.21
  bond_arp_interval 500
  bond_arp_validate 3

auto vlan102
  iface vlan102 inet static
  address 10.1.1.22
  netmask 255.255.255.0
  bridge_ports bond102
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0

then the kernel will mark both eth0.102 and eth1.102 as inactive, and you'll be connectivity free. This is presumably because the kernel wants to start sending ARP frames out of bond102, and can't, because it doesn't have an IP address. Attaching an IP isn't an option, as interfaces bonded to bridges don't normally work at L3.

This is a bit of a bore, as putting the bond device into a bridge is a requirement if your virtualistion system attaches virtual ethernet devices to bridges to get VM's connected - as almost all of them do. So I'm guessing most virtualisation systems, at least KVM and LXC, will be incompatible with arp monitoring.

If you do need to put your bonds into bridges and find miimon insufficient, then you could consider some kind of "truck and trailer" approach with a pair of bonds. If you had, for eg, a probe bond10 with arp monitoring not in a bridge, and a live traffic bond20 in a bridge, with no monitoring, then you could run a process that did something like this:

while true
  slave10=`cat /sys/class/net/bond10/bonding/active_slave`
  slave20=`cat /sys/class/net/bond20/bonding/active_slave`
  if $slave10 != $slave20 then
    echo "+$slave10" > /sys/class/net/bond20/bonding/slaves
    echo "-$slave20" > /sys/class/net/bond20/bonding/slaves
  fi
  sleep 1

I haven't implemented this, so I don't know whether it works/is feasible. While it's still better than miimon, it wouldn't catch vlan 20 being dropped from a trunk, so there are failure modes that native arp monitoring will pickup that this won't. In our case, we're primarily using OpenVZ VM's, so I've decided to use venet routed connectivity rather than veth in a bridge. If we ever need to use KVM or LXC, then I'll have to think again - a problem for another day.

So the upshot of all this is that each server has eth0 to switcha, and eth1 to switchb. Normally DRBD traffic is on eth1, everything else is on eth0, and both physical links will be in use at the same time, each running up to 1Gb/s (large file upload to a VM with a shared disk causing lots of DRBD traffic). A nice side effect of this build is that under normal load (all switches/links running), the traffic is more or less localised on each switch - the DRBD traffic stays in switchb, everything else in switcha. So I don't have to worry too much about congestion on the link between the switches.

If any link or switch fails, traffic will merge onto the other link, and both servers will keep running, albeit with reduced bandwidth. Given that our workload is RAM and CPU bound, that's a better result than merging all the workload onto one box.