A few months ago, I wrote about putting vlans of a bridge into other bridges. This turned out to be a dumb idea, because:

  1. it only gave access to half the bandwidth - only one of the two interfaces on each server is active at any given time

  2. it *blush* doesn't actually work - if you attach bridge1.123 as a port to, say, bridge v123, then attach a veth device from a VM to v123, turns out only broadcasts can cross the bridge. If you give v123 an IP address, it can reach the vm and the world, but the vm itself can't reach the world.

So that got the big fail. If the Dell switches (and Linux) had per vlan STP, that might have helped, if there was a Linux implementation of MST, that would definitely have helped. If my servers had more physical ports, that would have been useful. Enough with the woulda coulda shoulda.

As a work around, I went back to a more classic build for a Pacemaker/DRBD cluster - eth0 on boxa attached to switcha, eth0 on boxb attached to switchb, eth1 on each box tied with a direct cable, switches likewise. Front end traffic on eth0, DRBD and pacemaker traffic on eth1.

With the correct pacemaker setup to detect upstream router unreachability, this works OK, but it isn't optimal - if a switch dies, or the cross over cable gets pulled, or an ethernet interface dies, then one of the servers will get failed out, and all the load will shift onto the other server. It feels a bit blunt instrument to fail out a server because of an upstream fault - it would be better if both servers could keep running, albeit in an impaired fashion with reduced bandwidth.

While reading up on the Totem support for dual redundant rings, I happened across a post that essentially said "most people don't use redundant rings, they just use bonding". That got me to wondering.

Traditional load balancing bonding (lacp/802.3ad) for extra bandwidth doesn't work well between a server and two different switches, assuming you can get it to stand up at all. The MAC address of the server is learnt on both switches, and all sorts of wackiness will ensue. Actve-standby bonding for high availability, where only one link is active at any given time, should work to multiple switches, but we're back at problem 1 above - we've halved the available bandwidth.

That got me wondering if I could bond vlan sub interfaces together. Initial signs aren't promising - the 'tubes are full of information about how to make vlan sub interfaces of a bond, but there's very little on how take vlan sub interfaces of an ethernet port and bond them together. The only post I could find suggested it wasn't going to work, but didn't say why. Conceptually, there are some limitations - load balancing to multiple switches is still going to be problematic, and anything that requires communication between the switch and the server (LACP or similar) is not going to work, as the switch is not expecting the control packets to arrive tagged.

Load balancing often doesn't give you the behaviour you're expecting, particularly if there's a small number of MAC addresses at either end of the bonded link, and debugging it is a dark art, so I'm happy to steer well clear of it. In this case, I don't need the extra capacity of load balancing, I just want traffic separation - 1Gb for DRBD, and 1Gb for everything else, with sharing if one of the links goes down.

So my thought was that if I could make half a dozen matching vlans each on eth0 and eth1 and bond them into six different bond devices each running active-backup, then I could get some coarse load balancing by forcing the DRBD bond to favour the eth1.xxx slave, and all the other bonds to favour the eth0.xxx slave.

This, somewhat suprisingly, works an absolute charm. My config (for Debian squeeze) looks something like this:

# pacemaker vlan:
auto eth0.253
iface eth0.253 inet manual

auto eth1.253
iface eth1.253 inet manual

auto bond253
iface bond253 inet static
  address 192.168.253.11
  netmask 255.255.255.0
  bond-slaves eth0.253 eth1.253
  bond-mode active-backup
  bond-primary eth0.253
  bond_arp_ip_target 192.168.253.12
  bond_arp_interval 500
  bond_arp_validate 3
  up ip link set eth0.253 mtu 9000
  up ip link set eth1.253 mtu 9000
  up ip link set bond253 mtu 9000

# DRBD vlan:
auto eth0.254
iface eth0.254 inet manual

auto eth1.254
iface eth1.254 inet manual

auto bond254
iface bond254 inet static
  address 192.168.254.11
  netmask 255.255.255.0
  bond-slaves eth0.254 eth1.254
  bond-mode active-backup
  bond-primary eth1.254
  bond_arp_ip_target 192.168.254.12
  bond_arp_interval 500
  bond_arp_validate 3
  up ip link set eth0.254 mtu 9000
  up ip link set eth1.254 mtu 9000
  up ip link set bond254 mtu 9000

It's a slightly clumsy syntax (the mtu handling of bond interfaces is a bit odd), and there's an implicit race - the ethn.xxx vlan interfaces need to be created before the bond interface, in practice on bootup Debian runs top to bottom down the file, and I'm not going to be hotplugging, so it's a race I can live with. All good, wish I'd thought of it three months ago!

If you want the active-standby system to actually fail over when there is a problem, there are two failure detection mechanisms in the kernel - miimon, which watches physical ethernet link state, and arp monitoring, where each site sprays arp requests out on to the network and listens to replies. Miimon is a local physical link test, while the arp monitor is an end to end test.

My experience of switch failures (and I've seen a lot more than I like to admit) is that when they fail, they very rarely have the grace to shut down ethernet ports as they go. So while the miimon code will catch a switch getting powered down, it won't be much use if the switch has a software crash, or a human removes a vlan from an interface/switch. Therefore, in most real world situations the arp monitoring approach looks superior, so long as you can live with its limitations, and are emotionally prepared for the amount of broadcast traffic it sprays around within each vlan.

The major downside of the arp monitor is that it requires that the bond interface have a valid ipv4 config. It presumably doesn't work in a v6-only environment, and it certainly doesn't work if you want to take the bond device and put it into a bridge - if you do something like this:

auto eth0.102
iface eth0.102 inet manual

auto eth1.102
iface eth1.102 inet manual

auto bond102
iface bond102 inet manual
  bond-slaves eth0.102 eth1.102
  bond-mode active-backup
  bond-primary eth0.102
  bond_arp_ip_target 10.1.1.21
  bond_arp_interval 500
  bond_arp_validate 3

auto vlan102
  iface vlan102 inet static
  address 10.1.1.22
  netmask 255.255.255.0
  bridge_ports bond102
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0

then the kernel will mark both eth0.102 and eth1.102 as inactive, and you'll be connectivity free. This is presumably because the kernel wants to start sending ARP frames out of bond102, and can't, because it doesn't have an IP address. Attaching an IP isn't an option, as interfaces bonded to bridges don't normally work at L3.

This is a bit of a bore, as putting the bond device into a bridge is a requirement if your virtualistion system attaches virtual ethernet devices to bridges to get VM's connected - as almost all of them do. So I'm guessing most virtualisation systems, at least KVM and LXC, will be incompatible with arp monitoring.

If you do need to put your bonds into bridges and find miimon insufficient, then you could consider some kind of "truck and trailer" approach with a pair of bonds. If you had, for eg, a probe bond10 with arp monitoring not in a bridge, and a live traffic bond20 in a bridge, with no monitoring, then you could run a process that did something like this:

while true
  slave10=`cat /sys/class/net/bond10/bonding/active_slave`
  slave20=`cat /sys/class/net/bond20/bonding/active_slave`
  if $slave10 != $slave20 then
    echo "+$slave10" > /sys/class/net/bond20/bonding/slaves
    echo "-$slave20" > /sys/class/net/bond20/bonding/slaves
  fi
  sleep 1

I haven't implemented this, so I don't know whether it works/is feasible. While it's still better than miimon, it wouldn't catch vlan 20 being dropped from a trunk, so there are failure modes that native arp monitoring will pickup that this won't. In our case, we're primarily using OpenVZ VM's, so I've decided to use venet routed connectivity rather than veth in a bridge. If we ever need to use KVM or LXC, then I'll have to think again - a problem for another day.

So the upshot of all this is that each server has eth0 to switcha, and eth1 to switchb. Normally DRBD traffic is on eth1, everything else is on eth0, and both physical links will be in use at the same time, each running up to 1Gb/s (large file upload to a VM with a shared disk causing lots of DRBD traffic). A nice side effect of this build is that under normal load (all switches/links running), the traffic is more or less localised on each switch - the DRBD traffic stays in switchb, everything else in switcha. So I don't have to worry too much about congestion on the link between the switches.

If any link or switch fails, traffic will merge onto the other link, and both servers will keep running, albeit with reduced bandwidth. Given that our workload is RAM and CPU bound, that's a better result than merging all the workload onto one box.

Posted Mon Oct 17 14:36:55 2011 Tags:

I'm working on a project to take two FortiGate 60C routers, two Dell 5424 switches, and two Dell r410 servers (each with two GE interfaces) and lash them together such that a hardware failure in any component leaves the service running. This is conceptually pretty simple, in practice various constraints mean it is not as trivial as it seems. There are several approaches I could take:

  • create a left and a right vlan, plumb one leg of each server into a different vlan on each switch, and run a routing protocol like OSPF or BGP over both legs.

this would work, but requires complexity on each server which could limit the server OS choices (we'd need support for whatever protocol is chosen).

  • create a left and a right vlan, and use the load balancer on the fortigates to monitor the servers and decide which way to send traffic.

this will work OK for inbound traffic, but creates some complexity for outbound traffic - how does each server determine which vlan to use for return traffic?

  • run RSTP on the servers, and let layer two deal with the problem, use the fortigates to load balance between the servers in one vlan.

this has charm, in that it makes the IP build of the network much simpler, OTOH it pushes some complexity further down the stack, and it places similar restrictions on server OS choice, in that whatever we run needs to support RSTP. In practice, we mainly want Debian running OpenVZ, so this shouldn't be a problem.

The fortigates don't support STP, other then a basic passing or blocking STP frames, but if we're running them as an active/standby HA pair only one will be active, so it should all work.

The Dell switches support original IEEE STP, RSTP and MST (but not Cisco PVST+ and PVRST+), which means that they should interoperate with a linux bridge running STP/RSTP. The devil, of course, is in the detail - the RSTP support in linux is not well documented.

To test, initially, I tested standard STP - install Debian squeeze, bridge-utils and vlan packages. I made a bridge in /etc/network/interfaces:

iface br1 inet dhcp
 bridge_ports eth0 eth1
 bridge_stp on

then plumbed eth0 and eth1 to ports on each switch. The switch ports were configured as access ports in a vlan with a dhcp server.

Turn on RSTP on the switches - it should be backwards compatible. Make sure you know where the root of the spanning tree is - it is important that the STP root is on one of the switches, and that the linux boxes lose any STP elections, so that the STP root doesn't finish up on one of the Linux servers. On the primary switch, I did:

spanning-tree mode rstp 
spanning-tree priority 0

and on the secondary:

spanning-tree mode rstp 
spanning-tree priority 4096

(you can also use the web GUI). Run

# ifup br1

wait 30 seconds, and

# brctl showstp br1 

should show one port forwarding, one blocking, and you should have connectivity (DHCP may time out, in which case setting a static IP address on the bridge would be simpler). You can randomly pull cables from the server, or power down the switches, and you should retain connectivity, albeit with 30-60 second gaps while STP reconverges.

So far so good, onto RSTP. I gleaned most of this from the linux bridge mailing list, particularly the thread starting https://lists.linux-foundation.org/pipermail/bridge/2008-March/005765.html

First up, do the install dance:

# aptitude install build-essential
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/rstp.git
# cd rstp
# make clean ; make ; make install

included in the git repo is a bridge-stp script, that attempts to start the rstpd daemon, and enable RSTP for you. This script needs to be in /sbin/bridge-stp - it's called every time you try and enable STP on a bridge, and needs to return zero for RSTP to be enabled (there are more details in the thread link above).

This is the most crucial point - you must have a /sbin/bridge-stp, and to enable RSTP it must return 0!

You need to read the script and decide if it's going to do what you want - there are various fixes discussed in the thread if you do decide to use it. I decided to start rstpd from init.d, and so made a null bridge-stp script:

# cat > /sbin/bridge-stp
#!/bin/sh
exit 0
^D

and a simple init script for rstpd by copy and modifying /etc/init.d/skeleton:

# diff skeleton rstpd
 3,10c3,10
 < # Provides:          skeleton
 < # Required-Start:    $remote_fs $syslog
 < # Required-Stop:     $remote_fs $syslog
 < # Default-Start:     2 3 4 5
 < # Default-Stop:      0 1 6
 < # Short-Description: Example initscript
 < # Description:       This file should be used to construct scripts to be
 < #                    placed in /etc/init.d.
 ---
 > # Provides:          rstpd
 > # Required-Start:    mountkernfs $local_fs
 > # Required-Stop:     $local_fs
 > # Should-Start:      ifupdown
 > # Should-Stop:       ifupdown
 > # Default-Start:     S
 > # Default-Stop:      0 6
 > # Short-Description: Start the Rapid STP Daemon
 22,25c22,25
 < DESC="Description of the service"
 < NAME=daemonexecutablename
 < DAEMON=/usr/sbin/$NAME
 < DAEMON_ARGS="--options args"
 ---
 > DESC="Rapid STP Daemon"
 > NAME=rstpd
 > DAEMON=/sbin/$NAME
 > #DAEMON_ARGS="--options args"

Make sure you set insserv dependancies correctly - you want rstpd up and running before the bridges are bought up, or turning on RSTP won't work. I ended up changing the dependancies in /etc/init.d/networking to require rstpd first.

Once rstpd is running from boot, we need to add a line to turn on RSTP in /etc/network/interfaces:

iface br1 inet dhcp
 bridge_ports eth0 eth1
 bridge_stp on
 up rstpctl rstp br1 on

bring it up, if RSTP works, you should get:

# ifup br1
# cat /sys/class/net/br0/bridge/stp_state
2

# rstpctl showportdetail br1 eth0
Stp Port eth0: PortId: 8001 in Bridge 'br1':
Priority:          128
State:             Discarding             Uptime: 723      
PortPathCost:      admin: Auto            oper: 20000    
Point2Point:       admin: Auto            oper: Yes      
Edge:              admin: N               oper: N        
Partner:                                  oper: Rapid    
PathCost:          20000

should show you what operating mode the port is in, and what RSTP state it is in. Now you can unplug cables and power down switches, and you should see outages of at most a couple of seconds. All good

Dealing with VLANs.

Standard 802.1w RSTP is agnostic on the subject of vlans - it describes how you create a single spanning-tree for your entire ethernet, regardless of what you might be doing with vlans (as opposed to PVST+ and PVRST+, the cisco proprietary STP variants that run a spanning-tree per vlan). The simplicity of 802.1w has a couple of implications:

  • you can't do the vlan load balancing tricks that are common in the cisco world - rooting different vlans on different switches so that you make some use of your backup links. In practice, that means that you only get 1Gb out of each server, not 2Gb - if maximal bandwidth out of your servers is the desired outcome, RSTP isn't going to get you there.

  • it doesn't really make sense to bolt vlan interfaces into different bridges on a linux box if you're going to run STP - each bridge will try and run STP itself, and madness will ensue.

So the way that I ended up doing vlans is mildly counterintuitive - by making vlan subinterfaces of the first bridge and then bonding them into another bridge that exists only within the server, and doesn't run STP. So, for eg, to make vlan 101 on the switch network available to some VE's on the box, I converted the switchports facing the linux boxes into trunks, with vlan 101 tagged. I then added this to /etc/network/interfaces:

auto br1.101
iface br1.101 inet manual
 vlan_raw_device br1

auto vlan101
iface vlan101 inet manual
 bridge_ports br1.101
 bridge_stp off
 bridge_fd 0
 bridge_maxwait 0

This gives you a bridge called vlan101 that you can then insert your OpenVZ/KVM/LXC virtual ethernet devices into, you can repeat ad nauseum for other vlans.

There's an implicit race in the above config, in that br1.101 needs to be up before vlan101 can bond it. At the loss of some readability, a more robust approach is probably:

auto vlan101
iface vlan101 inet manual
 bridge_ports br1.101
 bridge_stp off
 bridge_fd 0
 bridge_maxwait 0
 pre-up vconfig set_name_type DEV_PLUS_VID_NO_PAD
 pre-up vconfig add br1 101
 post-down vconfig rem br1.101

Last observations

  • the native vlan under linux can be a bit of a lottery - if you're making vlan sub interfaces of any interface (hardware interfaces, or bridges, or other vlans), then you should not be suprised if the base device stops working (ie mixing eth0 and eth0.101 and eth0.1234 can lead to unpredictable behavior, and eth0 and eth0.101 and eth0.101.2345 even more so).

  • it is a good idea to enable root-guard on the switch ports facing the servers, if the switch supports it - that will stop a server setting its bridge priority to zero and forcing an election.

Posted Wed Jun 22 16:58:54 2011 Tags:

Setting up Centos to netboot and mount a root file system over NFS proved to be a whole lot harder than I thought it was going to be. There are a gazillion pages on the net about setting up DHCP/PXE/TFTP, and a gazillion more about how to netboot random disribution installers, but I couldn't find a clear guide on how to actually run centos (or RHEL) diskless. Ubuntu works more or less out of the box diskless, I guess RHEL/Centos is a server OS, and servers have disks, or something.

Anyhoo, I wanted to netboot centos in order to use the Dell firmware repos to update the firmware on some Dell r410's. I only want to run Centos for a few minutes per server, and as I'm allergic to physical media, and didn't want to rely on something server specific (like a DRAC) netbooting seemed like the go. Here's what I ended up doing:

My DHCP/PXE/NFS server runs debian squeeze, largely standard. I used rinse (the logical equivalent to debootstrap in the RPM world) to make an initial centos tree. Happily, rinse is a standard package in squeeze, so is just an aptitude away:

sudo aptitude install rinse
cd /exports
sudo rinse  --directory=centos5.6 --distribution=centos-5 --add-pkg-list extra-rpms

The extra-rpms file is an optional list of extra packages you want installed, mine looks like:

# cat extra-rpms
python-libs
nano

python-libs seemed to be needed to make rinse work properly, nano because it's personally familiar.

Rinse will haul in packages from your nearest Centos mirror, at the end of the process your new directory should have about 350MB of stuff in it. Chroot into the new tree, and make a few mods:

sudo chroot centos5.6
cat > /etc/fstab
  192.168.1.30:/exports/centos5.6 /        nfs     rw,tcp,nolock  0 0
  tmpfs                           /dev/shm tmpfs   defaults       0 0
  devpts                          /dev/pts devpts  gid=5,mode=620 0 0
  sysfs                           /sys     sysfs   defaults       0 0
  proc                            /proc    proc    defaults       0 0
^D

Make sure you substitute the IP of your NFS server, and the path to your newly created centos tree. You can also set root passwords, hostname, add users, and other things that don't start daemons (any daemons you start will end up littering the server you're running the chroot on).

We then use yum to install the kernel package...

yum -y install kernel

... and generate a new initrd with support for NFS. Make sure you add/substitute any other ethernet modules you need for your hardware in the preload argument below. mkinitrd requires /etc/modprobe.conf to exist - it hangs if it doesn't:

touch /etc/modprobe.conf

/sbin/mkinitrd -v -f --omit-scsi-modules --omit-raid-modules \
--omit-lvm-modules --without-usb --without-multipath --without-dmraid \
--preload="tg3 e100 bnx2 e1000 nfs" --net-dev=eth0 --rootfs=nfs  \
/tmp/test.img 2.6.18-238.12.1.el5

Leave your chroot, and copy the kernel and new initrd into your pxe tftp tree. The initrd is not world readable by default, so make sure you chmod it so that the tftp server can get at it

exit
sudo cp /nfs/centos5/boot/vmlinuz-2.6.18-238.12.1.el5 /tftpboot/pxe/centos-5.6/vmlinuz
sudo cp /nfs/centos5/tmp/test.img /tftpboot/pxe/centos-5.6/initrd.img
sudo chmod 644 /tftpboot/pxe/centos-5.6/initrd.img

If you regenerate your initrd multiple times, you don't have to copy the kernel each time, only the initrd - you only need to copy the kernel if it gets upgraded.

Add something like this to your pxe config:

LABEL CentOS 5.6 BIOS updater
MENU LABEL CentOS 5.6 BIOS updater
kernel centos-5.6/vmlinuz
APPEND vga=normal initrd=centos-5.6/initrd.img ramdisk_size=10000 ip=dhcp

Net boot a PC into the new config, and make whatever changes you need. I tend to install ssh because the KVM in the dell DRACs is annoying

yum -y install openssh-server

To use the Dell firmware tools, you'll need a few dependancies:

yum -y install which compat-libstdc++-33 nano wget gpg perl    

then follow the instructions on http://linux.dell.com/repo/hardware/latest/

wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash
yum install dell_ft_install
yum install $(bootstrap_firmware)

And you're done - on any new machine that needs to be firmware updated, boot the nfsroot centos, then run whatever combination of these is appropriate:

yum update
yum install $(bootstrap_firmware)
inventory_firmware
update_firmware
update_firmware --yes
Posted Fri Jun 17 10:09:49 2011 Tags:

The flag day discussed in WIX VLAN Migration went off smoothly, at least, as far as I know nobody grizzled.

There are now 101 devices in the WIX vlan, and ~25 left in the mud vlan. Traffic across the bridge isn't quite as low as I'd like, I'm expecting (praying) it'll drop as FX, NZWireless and Knossos move across.

If we ignore the route servers and the Citylink NMS, then the remaining devices in the mud vlan can be divided into two sets of customers - those working on migrating, and those Citylink hasn't talked to yet, and so who don't know they need to migrate.

First up, the easy ones. This first list are all in progress with each provider doing their thing, they have the Citylink resources they need - there's nothing particularly for Citylink to do other than pester them to stick with it.

ISP IP MAC Comments
fx3 202.7.1.175 0:21:a0:56:2c:19
fx4 202.7.1.176 b4:14:89:8:12:20
nzwireless 202.7.0.59 0:12:1e:7b:3e:14
nzwireless2 202.7.1.69 0:18:b9:a6:68:41
knossos 202.7.0.76 0:c:76:7e:92:80 waiting on fx3
knossos-duxton 202.7.0.102 0:d:b9:1a:16:d4 "
parliament1 202.7.0.86 0:1f:9e:fd:58:c0 waiting on internal tidyups
parliament2 202.7.0.142 0:23:5e:fe:33:e0 "
datalight 202.7.0.211 0:24:dc:12:e1:1 "

The more interesting list is the list of customers Citylink hasn't yet got to:

ISP IP MAC Comments
tc2 202.7.0.77 0:90:1a:40:21:3a
telstraclear 202.7.0.70 0:90:1a:40:3a:7d
tsb1lh 202.7.0.65 0:8:20:4c:1c:1a also 202.7.0.66
telstraclear2 202.7.0.73 0:90:1a:9f:f9:7c
clear-ba1-atm1 202.7.0.192 0:d0:bb:a8:d8:c0
xtra 202.7.1.241 0:d0:97:5:64:0
asnet1 202.7.0.243 0:9:f:67:1d:f7
asnet2 202.7.1.239 0:60:d1:4f:ee:f8
businessonline 202.7.0.212 0:f:35:2c:61:56 Vector?
juniper 202.7.0.94 0:5:85:ca:d0:c0
3months 202.7.0.58 0:14:a9:73:cb:d0
natlib1 202.7.1.193 0:1b:54:e1:eb:10
inz1 202.7.0.187 0:23:33:ed:5d:8a
ssc 202.7.0.67 0:25:45:f3:50:e1

If you ignore the first five TelstraClear devices which are something of a special case, the rest all have much the same issues which have prevented them being migrated into the WIX vlan. They are all responding to at least two IP address on the interface they have attached to the MUD vlan - generally a WIX address and some kind of transit activity on the same Citylink interface. The bridge between the two vlans drops non-WIX ARP, which means that their transit activities won't work so good if they were dropped inside the WIX vlan.

This is inferred from a close examiniation of arpwatch records, some pinging, and a fair amount of ancient memory, none of which are particularly accurate. So there's a likelihood that some of the second list above don't use WIX, or don't do transit over Citylink, or both, and haven't got around to tidying up their interfaces.

I'm likely signing off from the WIX migration now - it'll fall to somebody else within Citylink to finish off, which may mean that it gets persued with enthusiasm, or may not get persued at all, depending on how they feel about having vlans bridged together long term.

So if a router you control is in the first list above, keep at it and get migrated ASAP - your country needs you!

If a router you control is in the second list above, and you aren't using WIX, or you aren't using a transit block, then remove the unused IP's from your router interface, let Citylink know and they'll stuff you into the appropriate vlan.

If a router you control is in the second list above, and you are still doing both peering and transit on different IP blocks, and you don't like the idea of being a second class citizen on WIX, then you need another connection to Citylink (and another interface on your kit). I'd encourage anybody in that position to get in touch with Citylink and get the provisioning process moving. It'll go a lot better if you, an actual customer, talk to them, rather than waiting for them to get around to chatting with you.

Thanks for your time!

Posted Tue Jun 14 21:59:49 2011 Tags:

I'm doing some contract work for Citylink, working on the migration of WIX peers into a vlan dedicated to WIX. At the moment, there are about 130 individual devices responding to ARP in the WIX IP range (202.7.0.0/23). Of those, 50 are already in the new WIX vlan, about 80 remain in the "mud" vlan.

In order to kick the process along a bit, at 6am on Jun 13 (the flag day), I'll be moving the majority of WIX users remaining in the mud vlan into the WIX vlan.

Of the outstanding devices:

  • 20 do modest amounts of traffic, they'll be moved into the new VLAN over the next fortnight. 15 done as of 8/6/2011

  • 26 do more significant amounts of traffic, they'll be moved in one block on the flag day. The 26 includes most of the major traffic generators on WIX - Trademe, Harmony, DTS, Xtreme, Actrix, Vodafone, ICONZ, CatalystIT/Stuff, Datacom, Revera, Orcon, VUW.

  • According to arpwatch, ~30 appear to use more than one IP address in the mud vlan (ie, they're doing transit on the same interface), or are on a port with multiple MAC addresses. Unless these peers make changes they'll remain in the mud vlan, as moving them to the WIX vlan will break whatever other transit stuff they're doing. The full list of devices is below, most notable in it are 6 TelstraClear devices, 2 each for FX and NZWireless, and a few other smaller ISP's (ACS/Linuxnet, Knossos, InspireNET, Datalight, Parliament).

Most of the multi-homed devices will need new WIX ports provisioned, and they'll need to make config changes to split their WIX/non-WIX traffic, so while I'm not counting on any of them being ready to move by the flag day, I'm hoping some will. They'll continue to peer as before, they'll see a 1-2ms increase in latency to peers in the WIX vlan only.

The rationale for the flag day is pretty straightforward - between the new vlan and the old vlan there is a linux box acting as a filtering bridge, allowing WIX v4/v6 traffic and blocking everything else. That box has only a finite capacity, by moving the majority of the high bandwidth peers at (more or less) the same time, we avoid the possibility of DOSing the filtering PC. I know it can sustain ~500Mb/s, so while I don't believe there's going to be a capacity problem, I'd rather not run the risk.

There are also some devices for whom I've no idea what they're actually doing with the WIX addresses. They don't peer with the route servers, so I'm unclear if they make any actual use of WIX or if the numbers are just residual config left over from a previous age. Once the flag day has passed, and the majority of peers are "across the bridge", then I'll be able to do packet captures on the bridge to see if some of these mystery devices are actually doing anything on WIX.

For those peers remaining in the mud vlan, this box represents a new SPOF for their WIX peering - if it dies, WIX will fail for those peers. So there are modest incentives for both Citylink and the peers concerned to migrate in a timely fashion.

While the two vlan's are connected via the linux bridge, it is absolutely crucial that those splitting their services don't attach the same MAC address to both the WIX vlan and the mud vlan - things will go badly wrong for you if you do. The ultimate aim is to remove the linux bridge - once that is done, it'll be OK to attach the same MAC into both vlans.

Devices remaining in the mud vlan:

ISP IP MAC Comments
tc2 202.7.0.77 0:90:1a:40:21:3a
telstraclear 202.7.0.70 0:90:1a:40:3a:7d
tsc8lh 202.7.0.66 0:8:20:4c:1c:1a
tsb1lh 202.7.0.65 0:8:20:4c:1c:1a
telstraclear2 202.7.0.73 0:90:1a:9f:f9:7c
clear-ba1-atm1 202.7.0.192 0:d0:bb:a8:d8:c0
xtra 202.7.1.241 0:d0:97:5:64:0
knossos 202.7.0.76 0:c:76:7e:92:80
nzwireless 202.7.0.59 0:12:1e:7b:3e:14
nzwireless2 202.7.1.69 0:18:b9:a6:68:41
linuxnet/acs 202.7.0.245 0:12:44:ab:34:1b
inspire1 202.7.0.123 0:1b:c0:53:14:81
parliament1 202.7.0.86 0:1f:9e:fd:58:c0
parliament2 202.7.0.142 0:23:5e:fe:33:e0
fx3 202.7.1.175 0:21:a0:56:2c:19
fx4 202.7.1.176 0:23:4:aa:80:19
datalight 202.7.0.211 0:24:dc:12:e1:1
juniper 202.7.0.94 0:5:85:ca:d0:c0
asnet1 202.7.0.243 0:9:f:67:1d:f7
asnet2 202.7.1.239 0:60:d1:4f:ee:f8
businessonline 202.7.0.212 0:f:35:2c:61:56 Vector?
intergen1 202.7.0.121 0:12:44:ab:30:1b
tepapa 202.7.0.95 0:13:21:c9:37:2
3months 202.7.0.58 0:14:a9:73:cb:d0
natlib1 202.7.1.193 0:1b:54:e1:eb:10
inz1 202.7.0.187 0:23:33:ed:5d:8a
ssc 202.7.0.67 0:25:45:f3:50:e1
Posted Wed Jun 8 20:41:17 2011 Tags:

Back in 2008, Nate wrote a blog post about migration service providers on Citylink into their own vlans. I've included bits of his post below, as the original Citylink blog server appears MIA. Si, May 2010


Background Information

Citylink's Metro Ethernet network, commonly referred to as "PublicLAN", has traditionally comprised a single layer-2 VLAN running across a plethora of 10/100/1000Mbps Ethernet media converters, hubs and switches. The traditional Citylink model has seen both Service Providers (ISPs and other SPs) and customers attach directly to an untagged access port in the main PublicLAN VLAN. Customers and Service Providers attach with a layer-3 device (router/firewall) and present a single Ethernet MAC address to the Citylink switch/hub. Service Providers typically allocate IP address(es) to their customers which they then use to deliver services to them across the shared PublicLAN switch fabric.

The traditional PublicLAN architecture has proven to be simple and effective. Once customers have leased a Citylink PublicLAN access circuit, they are free to acquire services from as many of the multitude of Service Providers attached to the network as they desire. They can also opt to participate at the local Internet eXchange (IX) at no additional cost, thereby benefiting from the local exchange of routes with other Citylink users, optimising performance to local content and reducing transit fees.

Why the migration towards Service Provider VLANs?

The need to migrate PublicLAN from a single VLAN architecture to a multiple Service Provider VLAN architecture has come about in part due to requests from Service Providers to provide such a service along with enhanced service options, combined with the need to increase the reliability of the access network (particularly the IX fabric) and address some scalability issues.

Several Service Providers have been reluctant to provide transit services to customers via the same interface as their peering relationships. Some would prefer all their customers to appear in a separate VLAN from other Citylink users, and some have expressed an interest for each of their customers to appear as a separate logical (VLAN) interface on their router. Such a service has not been practical to date with the single VLAN architecture and existing hardware limitations. However, our use of Linux-based VLAN Termination Units (VTUs) for SPVLAN delivery gives us the ability to transport b oth dot1q tagged and untagged frames transparently, providing an emulated "Q-in-Q" service. Since then, Citylink now primarly does Q-in-Q on natively on Cisco ME switches, Si May 2010

Different PublicLAN services place different requirements upon the underlying network that may warrant additional security stances which have been difficult to enforce within the single VLAN architecture. Connections to the Wellington Internet eXchange, for example, often form a vital component of an organisation's Internet connectivity/operations, yet they are have been open to compromise from other Citylink customers using poor quality and/or badly-configured router/firewall devices that can potentially interfere with other users on the network. Separating the WIX participants out into a dedicated peering VLAN allows Citylink engineering staff to assert and enforce a much stronger security stance that helps to reduce the chance of one users' activities interfering with another.

Likewise, moving customers of a given Service Provider into a dedicated Service Provider VLAN has the benefit of reducing the possibility of one Service Provider's customer impacting the service of another Service Provider's customer(s). Such an environment is attractive to Service Providers who are now also able to run an extended range of protocols for customer delivery to suit their needs, which may otherwise be inappropriate for use on a shared VLAN such as the WIX.

The other important motivator for migrating PublicLAN from a single VLAN architecture to a VLAN-per-Service-Provider architecture is to address the scalability issues inherit with a large layer-2 domain. As the number of customers attached to PublicLAN has increased over time, so too has the amount of broadcast and flooded unicast traffic increased. These packets are duplicated to every access port in the network and are sent to every PublicLAN customer. As this noise floor has increased, so too has the amount of bandwidth consumed on inter-switch links and customer access circuits. For customers still using our lower bandwidth services such as Connect4, the level of such "noise" (meaning traffic not sourced by, or destined for, them) can noticeably impact the performance of their connection. Reducing the number of hosts attached to any given layer-2 domain reduces the noise floor for all customers in the VLAN.

There is, however, one disadvantage to the VLAN-per-Service-Provider architecture: customers will now need to advise Citylink if they wish to change Service Providers so that we can arrange to migrate their access connection into the relevant SPVLAN.

FAQs:

  • Will you be migrating everyone out of PublicLAN and into the Service Provider VLANs all at once?

    No. The migration from a single, multiple-provider VLAN to a VLAN-per-Service-Provider architecture requires significant hardware and software (configuration) changes for us which we will be rolling out gradually on the network. For existing Service Providers, a new SPVLAN will be created for them and then temporarily bridged to PublicLAN while individual customer access connections are migrated across. New Service Providers will have an SPVLAN configured on the network and their customers will be added to it right from the outset.

  • Will I experience an outage while I'm being migrated into my Service Provider's VLAN?

    In most cases, no. The use of a temporary bridge allows us to have your Service Provider's VLAN configured alongside PublicLAN, so that you can continue to access your SP regardless of which VLAN you are connecting to. For customers of a single Service Provider, the change from PublicLAN to a SPVLAN can happen transparently. Otherwise, if you're a customer of multiple Service Providers (or an ExchangeNET participant) your connection will need to be split across multiple Citylink ports. This will require changes to the way in which you attach to Citylink. We'll be in contact with you to assist if we believe this to be the case.

  • But if I'm going to need additional Citylink ports, how much is that going to cost?

    Since Citylink is driving these architectural changes, existing customers using multiple Service Providers (at the time of migration) will not face any additional charges. Citylink will provision the additional switchports required (including the installation of additional building cabling where necessary) at our expense. Note, this was 2008 - best you check with Citylink on whether this is still the case! Si, May 2010

  • What if I want to change to another Service Provider on Citylink?

    As time goes on, more and more Service Providers on the network will be providing services across their own SPVLAN rather than PublicLAN. Chances are that your existing SP and your new SP will be in separate VLANs and therefore your access connection will need to be moved into the new SPVLAN. The safest option would be to assume that we have already completed our migration to a VLAN-per-Service-Provider architecture, and advise us of your plans in advance so that we can schedule a time to make the necessary changes to your access connection. You can do this by filling out the online "Change of ISP" form at http://www.citylink.co.nz/services/forms/change-isp.html. If you have any special requirements (such as the requirement for period of "dual-service" to both Service Providers) then please contact your Citylink Account Manager.

Posted Mon Jun 16 00:00:00 2008 Tags:

Another archival post from the Citylink blog


For those who've been under a rock, Kiwicon hit town over the weekend. Citylink sponsered Cafenet 'net access for attendees, and with talk titles like "Busting Carrier Ethernet Networks" it seemed prudent to toddle along and find out what they were up to.

Having never attended a security Con before, I don't really have a comparable benchmark, I have to say I was well impressed. 200 people through the door, and some talks that made me go "uuuhh?" - can't ask for more than that.

The organisers commented on how suprised they were at the breadth and quality of security investigation going on in NZ, and I certainly had no idea there were so many people so active on so many interesting things.

I didn't get to see all the talks, and domestic and work commitments prevented me assisting with the assault on Mt Bartab, of what I did see, here are the things that caught my attention:

  • Peter Gutmann talking about the Psychology of Insecurity was thought provoking - I particularly liked "if user education worked, it would have worked by now". His talk was the first of a fairly consistent meme of the conference, which is that security is increasingly not a network problem - the network guys have been harping on about using crypto and strong firewalling and whatnot for years, now everybody does it, so now the network guys go "ack, it's encrypted, sorry, you're on your own there". There was very little discussion of circumventing firewalls at Kiwicon - it was all about vulnerabilities further up the stack, or in the users.

  • Graham Neilson trojanning blackberries was great, not least of which because I immediately thought "fantastic, finally, a way to back up my Blackberry from Linux". I need to track him down and get a copy of redberry.

  • The aussie guy talking about trojaned hypervisors was just disheartening, particularly as he was quoting bits from the Intel spec where they discuss how they're trying to make it as hard as possible for a guest OS to determine if it is running under a hypervisor. He painted a future where it would be impossible to tell if your OS is running under a trojaned hypervisor.

  • The chap from google who wrote "dark elevator" made me feel positive about the future of security, in a perverse kind of way. It's a simple tool that doesn't know anything about any particular exploit, it just fossicks about inside windows looking for insecure files that might be run on startup, or by an admin, and if it does, eventually makes you an Admin user. It doesn't work on freshly installed boxes, but on anything that has a reasonable amount of the usual third party stuff installed (that you need to get anything done), it works pretty much all the time. So this wasn't doing anything fancy, it wasn't even bruteforcing, it just made it really easy to test for vulnerabilities. You have to hope that tools like these will make the default security stance of machines improve.

  • Metlstorm's talk about carrier ethernet security was a little bit of a let down, in that he'd been muzzled by the telcos. So he talked a little about the usual layer 2 attack vectors (CAM spoofing, CAM overflow, STP/802.1q abuse), none of which are particularly new, nor particular difficult to prevent with the correct switch settings. He then went on to talk about a hypothetical telco which used "vlan private edge" (that's cisco's term - insert your vendors logical equivalent) to provide security separation between users in the same VLAN. That any telco would do that beggared belief - it's such a stupid idea, it hadn't occurred to me that any telco in NZ would base a secure product on it. He didn't have any rinkydink new attack vectors, which in one sense was a relief, and in another a little bit of a let down.

So, something for everyone, and in several talks I sat there thinking "damn, so-and-so should have been here to hear this". Particularly, every programmer in Wellington should have been there. I came back from the con knowing more, and realising I know even less, which marked it as time well spent.

Cafenet didn't explode in a heap, but that's not altogether unsurprising - the nature of the Con means that lots of people didn't get their laptop out and run the risk of it getting 0wned. So unlike NZNOG, where when you present you see nothing but eyeballs above the serried ranks of notebook screens, at Kiwicon most attendees sat there with pen and paper and paid attention. Which was good.

The usage patterns for Cafenet over the weekend aren't markedly different from normal - they're so uninteresting, I'm not even going to post a graph. It's a shame that Dane and I didn't think of running a traffic analyser on Cafenet until halfway through the last presentation, we could have done a presentation on all the stuff folks got up to.

The venue (Rutherford House) was great - good aircon, excellent acoustics, reasonably comfy seats. The lack of power points was an issue (what was up with the guy at the back who played WoW all weekend - wouldn't it have been cheaper to go to an internet cafe?). All in all, I was well impressed - a terrific organisational job, and I'll definitely be back for the next one, whenever it may be. Well worth $50.

Posted Sun Nov 18 00:00:00 2007 Tags:

Another archival post from the Citylink blog


Loopback Saturday revisited

Since the earlier writeup about the big fault of Oct 6th, I've received questions from various folks, and we've gained further data from the network that has caused us to revisit the conclusions of the first writeup. Most significantly, we think the network fault was caused by hardware failure, rather than MAC table overflow.

Last week, we had a switch in the Majestic Centre lock up - it wasn't pingable or manageable, and the customers in that building were disconnected from the network. This is a relatively uncommon occurrence - we've had very few real hardware failures with Cisco gear (fan and PS failures), and the IOS loads for the lowend L2 Cisco switches are generally bombproof - not unsuprising, since most of the magic happens in hardware. So other than the early 1548's that were so unreliable they got sent back, we haven't had more than half a dozen actual Cisco switch lockups in the last 10 years.

So, when the switch in the Majestic Centre crashed during the main outage, and went offline again 20 days later, that piqued our interest. The switch was showing errors much like these when we got onto the console:

SCHAN ERROR INTR: unit=0 SRC=6 DST=5 OPCODE=20 ERRCODE=5

This was at 2am on Oct 26, so after a little Googling suggested that this was a hardware error, we swapped the 2950 for a 2960, and the on-call guys went back to bed. Following up, we found a cisco bug report that says:

Under certain level of traffic load, the switch will start logging the following messages on the console:

SCHAN ERROR INTR: SRC=6 DST=5 OPCODE=20 ERRCODE=5

And after a few seconds, the switch will stop passing any traffic.

In some cases, the switch seemed still forwarding broadcast and multicast traffic, which will cause STP problem if the switch has redundant link and is not supposed to be the root for the VLAN, as both port will go forwarding.

Two units were returned by CISCO. The units were re-screened to the latest test program, and failed the SDRAM memory test.

Customer should RMA unit back to Cisco.

During the Oct 6th outage, the Majestic Centre 2950 was on a ring, and should have been blocking on some ports. After that outage, we singlehomed all the 2950's that were multihomed, including the Majestic Centre switch, so when it failed on Oct 26, it wasn't in a position to close a loop.

So, this causes us to reconsider some of our conclusions about what caused Oct 6th. While still possible, the "something injected lots of MAC addresses" hypothesis is no longer our strongest candidate root cause for the spanning-tree instability - it now seems more likely that a hardware fault kicked off the stability problems on the network.

In terms of preventing the problem happen again, our plan hasn't changed - we've implemented all the changes discussed in the earlier post, except for the MAC filtering, which should go live shortly (there is a fair amount of latitude for it to go wrong, so we are being careful).

Various questions asked over the last few weeks:

  • "did you question everybody connected to the last half dozen switches to see if any of them were doing anything random"?

    Yes. I didn't speak to all of them personally, but we got around most of them in the following few days. I've no reason to believe any of them was doing anything out of the ordinary, I'm prefer the "we pulled it all apart, and then put it back together, and it worked" explanation.

  • "what do you mean by 'out of network' management"

    That wasn't my first choice of phrase - I originally wrote "out of band", but we changed it for reasons that I now don't recall.

    Currently, we manage the network from an administration VLAN within the network itself. We're careful about it, we follow the Cisco BCP's (ie, don't use VLAN 1, access lists for logins/tacacs/logging, we don't allow the management VLAN to be expressed on any ports in the field, other than the interswitch trunk links), and we monitor ARP/MAC noise inside the VLAN. This has worked as our primary mgmt method for many years, so we've never bothered to implement anything "ex network".

    At the moment, we're tossing up what the best way of ensuring emergency access to switches may be - we're looking at various mobile 3G/RT based services, and are considering ADSL/dialup, but it's looking like the best option may be to construct a second ethernet on other fibre. With single-fibre SFPs getting cheaper over the last few months, it maybe that we convert our existing dual fibre circuits into a pair of single fibre circuits, and run an independant ethernet for management.

  • Citylink eventually built an out of band network to the core nodes using CWDM wavelengths and seperate switches at the core nodes, with serial console terminal servers. It proved quite useful, particularly as 7609's seem to regularly forget their VTP config when rebooted. *

  • How did you get this past "legal"? (a question mainly from folks outside NZ).

    New Zealand isn't a particularly litigious culture - Citylink doesn't have a legal team, we contract legal services in on the odd occasions we need them (which isn't that often). It never occurred to me that what I was writing might materially affect Citylink's legal position, and as far as I know it hasn't.

Posted Thu Nov 8 00:00:00 2007 Tags:

Another historical post from the Citylink blog


Loopback Saturday - the indepth discussion

As promised, here is a (more) indepth report on what happened Saturday last - much later than promised. I did plan to have it out earlier in the week, but we had another similar episode on Tuesday evening, which provided further data and forced various changes to the text.

Summary

On Saturday Oct 6, the Citylink ethernet suffered a city-wide failure, for 6-10 hours. On the evening of Tuesday Oct 16 (9:30-11pm) we had a issue similar in nature, (although with significantly less impact). A discussion about why the network failed follows below, but before we get into that, we want to list some of the things we're doing to reduce the chance of this happening again:

  • maximum MAC address count enforcement on every customer facing port
  • disabling keepalive (loopback) packets on interswitch links
  • development of an "out of network" management network
  • removing some of the diversity in the ethernet mesh

In addition, several of our support systems weren't prepared for an outage of this scale, and as a result many customers could not contact us. We will be changing internal processes so that things work better in the future, including:

  • an offnet network status page that does not rely on Citylink network availability for reachability
  • improving our phone systems to allow for more concurrent calls, and automated status messaging
  • stronger internal escalation procedures
  • fixing various internal systems that have external DNS dependencies, so that they still work without connectivity, and routing NMS traffic around the spam filters so that the NMS doesn't DOS the mail server.

So, what actually happened?

At around 8:30am on Saturday Oct 6, many switches on the Citylink ethernet in Wellington started logging errors of this form:

9:07: %ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on Fast Ethernet0/24.
9:07: %PM-4-ERR_DISABLE: loopback error detected on Fa0/24, putting Fa0/24 in err-disable state
9:08: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/24, changed state to down
9:09: %LINK-3-UPDOWN: Interface FastEthernet0/24, changed state to down
9:50: %PM-4-ERR_RECOVER: Attempting to recover from loopback err-disable state on Fa0/24
9:55: %LINK-3-UPDOWN: Interface FastEthernet0/24, changed state to up
9:57: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/24, changed state to up
9:57: %ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on FastEthernet0/24.

Keepalive (loopback) packets are sent by Cisco ethernet switch interfaces every 10 seconds, with the source and destination MAC addresses in the packet set to the MAC address of the switch interface. If the interface then sees those packets coming back to it, the switch thinks a loop has occurred, and the port is err-disabled. If err-disable recovery is enabled, then some time later the switch brings the interface back up again, and the process starts over. Consistent with the above output, Citylink has err-disable recovery times set to 35-40 seconds.

The Citylink ethernet topology is somewhat different from the standard core-distribution-leaf that you'll find documented by many vendors - it is a set of interconnected meshes, rather than a well defined core. This makes it fairly robust when faced with fibre cut or individual switch/building power/gear failure, but also means that we're highly dependant on spanning-tree to enforce a minimal tree of working links and prevent the physical loops in the fibre topology from becoming logical loops in the ethernet.

So, we had an ethernet loop - keepalive packets that should be being dropped by the receiving switch are being forwarded on, and finding their way back to the sending switch. This is a bad thing. As part of normal operations Citylink staff regularly create fibre loops - you can't have redundancy without loops, and spanning-tree (particularly rapid spanning-tree) normally does a decent job of disabling redundant links without customer impact. Loops "ex Citylink" are not an entirely uncommon occurence either - many customers have multiple connections around the city, and there are many alternate ways to tie those connections together - dark fibre, wireless, ethernet services from other carriers.

We'd normally expect a customer created loop perhaps once every 12-18 months, and while they make for an interesting couple of hours for the customer concerned, they're not normally particularly problematic - they certainly don't cause citywide outages.

Normally, when we get a customer loop, we track down the two edge ports involved and shut at least one of them down. This generally attracts the attention of the customer involved, and whatever they are doing to cause the loop gets resolved. This time, however, with the keepalive/err-disable recovery process on a repeating cycle of shutting down for 35 seconds and coming up for up to 10 seconds, it was difficult to reach many of the switches on the network. The further the switch is from the NMS's (in terms of switch hops), the more likely that there would be a link down somewhere between it and the NMS's, and consequently the amount of time it would be reachable from the NMS's would tend towards zero if the hop count was more than ~three. That does explain, though why many customers had sporadic reachability through the outage (enough, for eg, to keep BGP up in lots of cases).

Essentially, we couldn't reach or manage a significant proportion of the ethernet, nor were we receiving traps back from it. At this point we were working from significantly incomplete information.

So how did we fix this?

The last time we had major loop problems, it was with the q-in-q links that we run over other carrier networks to reach points outside the Citylink fibre. All these links terminate in our node at AT/T House on Murphy St. In the absence of any other information, we posited that it could have been one of those links that was looping, so Citylink staff headed directly to Murphy St to start unplugging things.

This take somewhat longer than it should have, due to access issues - if you have after hours external access cards for AT/T House, you'd do well to test them and make sure they still work. By midday, we had disconnected most of the "looping suspects" from the network, which did not alter the behavior of the network in any material way.

At this point, clearly another approach was needed. We decided to binary chop our way to the cause of the problem, by breaking the network up into smaller ethernets and seeing which bits worked, and which didn't This proved to be somewhat harder to achieve than you might expect - after some problems in Feb 2004 where some of the backup links between the northern and southern parts didn't work, we have provisioned a significant amount of redundancy through the centre of the network in the last three years. So to split the network into two halves, we had to drop nine separate fibre links between a dozen buildings.

By 2:30pm, we had the network south of a line through (approximately) VUW-Plimmer Towers-WCC-Te Papa working as a normal network, and the network north of that still looping frantically. This was great for the small amount of traffic that both sources and sinks into the south end of the network, but not much use to everyone else.

By about 4pm, everything in Thorndon (north of a line through Parliament/the Railway Station), including AT/T House was attached to the working south end of the network. This was a fairly important milestone to reach, as many of the major ISP's are connected in that part of town. It also got the Citylink website/mail server reachable, and the office phones back and working.

We continued adding sections of the network back, eventually getting to the point where all switches were reattached to the network except for half a dozen switches around the bottom end of The Terrace/Bowen St:

  • 33 Bowen St
  • 1 Bowen St
  • RBNZ
  • Treasury
  • Beehive
  • Met Service

If we brought up a link into that cluster of switches, the entire network failed within 20-40 seconds, if we took the link down again, the rest of the network (some 150 switches) came right within a few seconds. That suggested that the problem was somewhere in that part of town, however, when we added each switch in turn, we got all six attached, and every switch in the network was working (other than two elsewhere in the network that had wedged) by about 7pm.

Given the limited information we have available, we are unlikely to establish a specific root cause. That being the case, this week we've been focusing on

  • what happened in a general sense,
  • what we did wrong during the day, and
  • what we can do to the network to make it more robust in the future

One of the questions we've been asking ourselves is - why did this loop cause so much more trouble than previous loops? To answer that question, we need to understand what has changed to make loops so much more problematic.

In the last year, we've enabled rapid spanning-tree throughout the network. RPVST is great, it reduces the time to reconverge from 30-60 seconds to 0.5-2 seconds, but at the cost of increasing CPU load on the switches involved. We've also enabled err-disable recovery on all boxes that support it. The err-disable recovery isn't enabled specifically for loopback recovery, it is to remedy several other (far more common) errors that we enable it. Here's a typical config snippet:

errdisable recovery cause udld
errdisable recovery cause bpduguard
errdisable recovery cause link-flap
errdisable recovery cause gbic-invalid
errdisable recovery cause loopback
errdisable recovery cause psecure-violation
errdisable recovery interval 43

We have also spent some time understanding what circumstances cause keepalive packets to be forwarded by a switch - with the source and destination MAC in the packet set to the MAC address of the sending interface, normally the keepalive packets shouldn't) see every keepalive packet being generated by every switch on the network.

So all that taken together, we suspect that a significant number of MAC addresses were injected into the network, possibly in the Bowen St area. MAC table overflow will cause a switch to behave like a hub - flooding frames to all ports. That would cause keepalive frames to get forwarded around the network, when normally they'll be dropped by the receiving switch. Once the loopback detection mechanisms started to kick in and links started flapping, then the increasing CPU load caused the spanning-tree to get unstable, which caused more loops, more loopback link flaps, and a melt down ensued.

Cisco note:

The problem occurs because the keepalive packet is looped back to the port that sent the keepalive. There is a loop in the network. Although disabling the keepalive will prevent the interface from being errdisabled, it will not remove the loop.

The problem is aggravated if there are a large number of Topology Change Notifications on the network. When a switch receives a BPDU with the Topology Change bit set, the switch will fast age the MAC Address table. When this happens, the number of flooded packets increases because the MAC Address table is empty. ... Keepalives are sent on ALL interfaces by default in 12.1EA based software. Starting in 12.2SE based releases, keepalives are NO longer sent by default on fiber and uplink interfaces.

http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCea46385

That Cisco disable keepalives in newer code is instructive - reading between the lines, they seem to be acknowledging that running two largely independant loop detection/prevention systems (keepalives, and STP) is not optimal.

On Tuesday evening, we observed that boxes running 12.1 or earlier (in our case, mainly smaller 2950 switches) had significantly elevated CPU loads during the outage, whereas boxes running 12.2 (2960/2970/3550 models) didn't show anything like as much load. That is presumably mainly because the latter machines have more capable CPU's, but may also have something to do with the 2950's multihomed in a ring have higher CPU loads than those singlehomed at the edge of the network.

The further a switch is from the root of the spanning-tree, the greater the chance that it will be the switch expected to disable a link in order to prevent a loop (if it's multihomed). Like many networks, our older 2950 switches have tended to migrate out to the edge of the network as more capable kit is deployed in the centre. If they are single homed and get confused about the state of the spanning-tree, there is no real problem as all ports will be forwarding, but having a 2950 multihomed at the edge of the network appears to be a poor idea when CPU load goes up.

For other reasons (no large frame support, no optical interfaces), we have been removing our 2950 switches over the last six months, but it'll be some time before that process is complete. In the short term we are ensuring that none of the 2950's are multihomed.

If this is what happened (and it's only an informed guess at the moment), then we are pretty sure that the measures above (enforcing MAC count limits on every port, disabling keepalives on interswitch links, single homing all 2950's) will prevent the problem from reoccuring - it won't stop somebody looping Citylink, but it should dramatically reduce the impact if they do.

Almost all our gear now supports secure MAC address table ageing, which means that we will be able to enforce a maximum MAC count per port without each customer having to tell us what MAC's they're using, which is good - the alternative would significantly elevate admin overhead, which we are reluctant to do.

As of Friday Oct 19, we have turned off the keepalive packets on all interswitch links, and converted the majority of multihomed 2950's to single homed 2950's. We have not yet implemented the MAC address restrictions - that will start to happen next week.

If you made it this far, well done - thank you for your attention! I hope this has shed a little light on what happened. To finish off, I'll respond to a couple of themes that have popped up repeatedly over the last two weeks:

To the folks that have observed variations on "this is bound to happen with a straight L2 spanning-tree network, you should run MPLS|ATM|Token Ring|SDH|EAPS|something else", all that you say is undoubtedly true. All technology choices have a cost/risk tradeoff, given that this is the first time Citylink has failed this completely in ~10 years, I'm personally relatively comfortable with the way the network works. Of course, if it blows up again in the same fashion shortly, I may well rapidly revise that view!

To the conspiracy theorists who want to know if Citylink was under intentional attack, the short answer is "I don't know". Blaming unknown blackhats is enormously tempting in all sorts of situations when you don't quite know what has gone on, no matter how implausible.

Citylink has always been a very open network that relies on everybody attached to it to play by the rules and show some common sense. In all the myriad ways people have found to DOS each other over Citylink in the last 10 years, be it proxy-arp, virii, worms, IP address duplication, BGP route hijinks, spanning-tree strangeness, L2 path problems or whatever, I've rarely had anything but genuine remorse from people when it's been pointed out that they've done something wrong. There have been many accusations of intentional attack but none ever proven, and that is the way I think an Internet exchange should be.

So while I can't categorically say that it wasn't intentional, I personally don't like blaming malice when there's so much scope for basic randomness.

Posted Fri Oct 19 00:00:00 2007 Tags:

This blog is powered by ikiwiki.