Another historical post from the Citylink blog


Loopback Saturday - the indepth discussion

As promised, here is a (more) indepth report on what happened Saturday last - much later than promised. I did plan to have it out earlier in the week, but we had another similar episode on Tuesday evening, which provided further data and forced various changes to the text.

Summary

On Saturday Oct 6, the Citylink ethernet suffered a city-wide failure, for 6-10 hours. On the evening of Tuesday Oct 16 (9:30-11pm) we had a issue similar in nature, (although with significantly less impact). A discussion about why the network failed follows below, but before we get into that, we want to list some of the things we're doing to reduce the chance of this happening again:

  • maximum MAC address count enforcement on every customer facing port
  • disabling keepalive (loopback) packets on interswitch links
  • development of an "out of network" management network
  • removing some of the diversity in the ethernet mesh

In addition, several of our support systems weren't prepared for an outage of this scale, and as a result many customers could not contact us. We will be changing internal processes so that things work better in the future, including:

  • an offnet network status page that does not rely on Citylink network availability for reachability
  • improving our phone systems to allow for more concurrent calls, and automated status messaging
  • stronger internal escalation procedures
  • fixing various internal systems that have external DNS dependencies, so that they still work without connectivity, and routing NMS traffic around the spam filters so that the NMS doesn't DOS the mail server.

So, what actually happened?

At around 8:30am on Saturday Oct 6, many switches on the Citylink ethernet in Wellington started logging errors of this form:

9:07: %ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on Fast Ethernet0/24.
9:07: %PM-4-ERR_DISABLE: loopback error detected on Fa0/24, putting Fa0/24 in err-disable state
9:08: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/24, changed state to down
9:09: %LINK-3-UPDOWN: Interface FastEthernet0/24, changed state to down
9:50: %PM-4-ERR_RECOVER: Attempting to recover from loopback err-disable state on Fa0/24
9:55: %LINK-3-UPDOWN: Interface FastEthernet0/24, changed state to up
9:57: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/24, changed state to up
9:57: %ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on FastEthernet0/24.

Keepalive (loopback) packets are sent by Cisco ethernet switch interfaces every 10 seconds, with the source and destination MAC addresses in the packet set to the MAC address of the switch interface. If the interface then sees those packets coming back to it, the switch thinks a loop has occurred, and the port is err-disabled. If err-disable recovery is enabled, then some time later the switch brings the interface back up again, and the process starts over. Consistent with the above output, Citylink has err-disable recovery times set to 35-40 seconds.

The Citylink ethernet topology is somewhat different from the standard core-distribution-leaf that you'll find documented by many vendors - it is a set of interconnected meshes, rather than a well defined core. This makes it fairly robust when faced with fibre cut or individual switch/building power/gear failure, but also means that we're highly dependant on spanning-tree to enforce a minimal tree of working links and prevent the physical loops in the fibre topology from becoming logical loops in the ethernet.

So, we had an ethernet loop - keepalive packets that should be being dropped by the receiving switch are being forwarded on, and finding their way back to the sending switch. This is a bad thing. As part of normal operations Citylink staff regularly create fibre loops - you can't have redundancy without loops, and spanning-tree (particularly rapid spanning-tree) normally does a decent job of disabling redundant links without customer impact. Loops "ex Citylink" are not an entirely uncommon occurence either - many customers have multiple connections around the city, and there are many alternate ways to tie those connections together - dark fibre, wireless, ethernet services from other carriers.

We'd normally expect a customer created loop perhaps once every 12-18 months, and while they make for an interesting couple of hours for the customer concerned, they're not normally particularly problematic - they certainly don't cause citywide outages.

Normally, when we get a customer loop, we track down the two edge ports involved and shut at least one of them down. This generally attracts the attention of the customer involved, and whatever they are doing to cause the loop gets resolved. This time, however, with the keepalive/err-disable recovery process on a repeating cycle of shutting down for 35 seconds and coming up for up to 10 seconds, it was difficult to reach many of the switches on the network. The further the switch is from the NMS's (in terms of switch hops), the more likely that there would be a link down somewhere between it and the NMS's, and consequently the amount of time it would be reachable from the NMS's would tend towards zero if the hop count was more than ~three. That does explain, though why many customers had sporadic reachability through the outage (enough, for eg, to keep BGP up in lots of cases).

Essentially, we couldn't reach or manage a significant proportion of the ethernet, nor were we receiving traps back from it. At this point we were working from significantly incomplete information.

So how did we fix this?

The last time we had major loop problems, it was with the q-in-q links that we run over other carrier networks to reach points outside the Citylink fibre. All these links terminate in our node at AT/T House on Murphy St. In the absence of any other information, we posited that it could have been one of those links that was looping, so Citylink staff headed directly to Murphy St to start unplugging things.

This take somewhat longer than it should have, due to access issues - if you have after hours external access cards for AT/T House, you'd do well to test them and make sure they still work. By midday, we had disconnected most of the "looping suspects" from the network, which did not alter the behavior of the network in any material way.

At this point, clearly another approach was needed. We decided to binary chop our way to the cause of the problem, by breaking the network up into smaller ethernets and seeing which bits worked, and which didn't This proved to be somewhat harder to achieve than you might expect - after some problems in Feb 2004 where some of the backup links between the northern and southern parts didn't work, we have provisioned a significant amount of redundancy through the centre of the network in the last three years. So to split the network into two halves, we had to drop nine separate fibre links between a dozen buildings.

By 2:30pm, we had the network south of a line through (approximately) VUW-Plimmer Towers-WCC-Te Papa working as a normal network, and the network north of that still looping frantically. This was great for the small amount of traffic that both sources and sinks into the south end of the network, but not much use to everyone else.

By about 4pm, everything in Thorndon (north of a line through Parliament/the Railway Station), including AT/T House was attached to the working south end of the network. This was a fairly important milestone to reach, as many of the major ISP's are connected in that part of town. It also got the Citylink website/mail server reachable, and the office phones back and working.

We continued adding sections of the network back, eventually getting to the point where all switches were reattached to the network except for half a dozen switches around the bottom end of The Terrace/Bowen St:

  • 33 Bowen St
  • 1 Bowen St
  • RBNZ
  • Treasury
  • Beehive
  • Met Service

If we brought up a link into that cluster of switches, the entire network failed within 20-40 seconds, if we took the link down again, the rest of the network (some 150 switches) came right within a few seconds. That suggested that the problem was somewhere in that part of town, however, when we added each switch in turn, we got all six attached, and every switch in the network was working (other than two elsewhere in the network that had wedged) by about 7pm.

Given the limited information we have available, we are unlikely to establish a specific root cause. That being the case, this week we've been focusing on

  • what happened in a general sense,
  • what we did wrong during the day, and
  • what we can do to the network to make it more robust in the future

One of the questions we've been asking ourselves is - why did this loop cause so much more trouble than previous loops? To answer that question, we need to understand what has changed to make loops so much more problematic.

In the last year, we've enabled rapid spanning-tree throughout the network. RPVST is great, it reduces the time to reconverge from 30-60 seconds to 0.5-2 seconds, but at the cost of increasing CPU load on the switches involved. We've also enabled err-disable recovery on all boxes that support it. The err-disable recovery isn't enabled specifically for loopback recovery, it is to remedy several other (far more common) errors that we enable it. Here's a typical config snippet:

errdisable recovery cause udld
errdisable recovery cause bpduguard
errdisable recovery cause link-flap
errdisable recovery cause gbic-invalid
errdisable recovery cause loopback
errdisable recovery cause psecure-violation
errdisable recovery interval 43

We have also spent some time understanding what circumstances cause keepalive packets to be forwarded by a switch - with the source and destination MAC in the packet set to the MAC address of the sending interface, normally the keepalive packets shouldn't) see every keepalive packet being generated by every switch on the network.

So all that taken together, we suspect that a significant number of MAC addresses were injected into the network, possibly in the Bowen St area. MAC table overflow will cause a switch to behave like a hub - flooding frames to all ports. That would cause keepalive frames to get forwarded around the network, when normally they'll be dropped by the receiving switch. Once the loopback detection mechanisms started to kick in and links started flapping, then the increasing CPU load caused the spanning-tree to get unstable, which caused more loops, more loopback link flaps, and a melt down ensued.

Cisco note:

The problem occurs because the keepalive packet is looped back to the port that sent the keepalive. There is a loop in the network. Although disabling the keepalive will prevent the interface from being errdisabled, it will not remove the loop.

The problem is aggravated if there are a large number of Topology Change Notifications on the network. When a switch receives a BPDU with the Topology Change bit set, the switch will fast age the MAC Address table. When this happens, the number of flooded packets increases because the MAC Address table is empty. ... Keepalives are sent on ALL interfaces by default in 12.1EA based software. Starting in 12.2SE based releases, keepalives are NO longer sent by default on fiber and uplink interfaces.

http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCea46385

That Cisco disable keepalives in newer code is instructive - reading between the lines, they seem to be acknowledging that running two largely independant loop detection/prevention systems (keepalives, and STP) is not optimal.

On Tuesday evening, we observed that boxes running 12.1 or earlier (in our case, mainly smaller 2950 switches) had significantly elevated CPU loads during the outage, whereas boxes running 12.2 (2960/2970/3550 models) didn't show anything like as much load. That is presumably mainly because the latter machines have more capable CPU's, but may also have something to do with the 2950's multihomed in a ring have higher CPU loads than those singlehomed at the edge of the network.

The further a switch is from the root of the spanning-tree, the greater the chance that it will be the switch expected to disable a link in order to prevent a loop (if it's multihomed). Like many networks, our older 2950 switches have tended to migrate out to the edge of the network as more capable kit is deployed in the centre. If they are single homed and get confused about the state of the spanning-tree, there is no real problem as all ports will be forwarding, but having a 2950 multihomed at the edge of the network appears to be a poor idea when CPU load goes up.

For other reasons (no large frame support, no optical interfaces), we have been removing our 2950 switches over the last six months, but it'll be some time before that process is complete. In the short term we are ensuring that none of the 2950's are multihomed.

If this is what happened (and it's only an informed guess at the moment), then we are pretty sure that the measures above (enforcing MAC count limits on every port, disabling keepalives on interswitch links, single homing all 2950's) will prevent the problem from reoccuring - it won't stop somebody looping Citylink, but it should dramatically reduce the impact if they do.

Almost all our gear now supports secure MAC address table ageing, which means that we will be able to enforce a maximum MAC count per port without each customer having to tell us what MAC's they're using, which is good - the alternative would significantly elevate admin overhead, which we are reluctant to do.

As of Friday Oct 19, we have turned off the keepalive packets on all interswitch links, and converted the majority of multihomed 2950's to single homed 2950's. We have not yet implemented the MAC address restrictions - that will start to happen next week.

If you made it this far, well done - thank you for your attention! I hope this has shed a little light on what happened. To finish off, I'll respond to a couple of themes that have popped up repeatedly over the last two weeks:

To the folks that have observed variations on "this is bound to happen with a straight L2 spanning-tree network, you should run MPLS|ATM|Token Ring|SDH|EAPS|something else", all that you say is undoubtedly true. All technology choices have a cost/risk tradeoff, given that this is the first time Citylink has failed this completely in ~10 years, I'm personally relatively comfortable with the way the network works. Of course, if it blows up again in the same fashion shortly, I may well rapidly revise that view!

To the conspiracy theorists who want to know if Citylink was under intentional attack, the short answer is "I don't know". Blaming unknown blackhats is enormously tempting in all sorts of situations when you don't quite know what has gone on, no matter how implausible.

Citylink has always been a very open network that relies on everybody attached to it to play by the rules and show some common sense. In all the myriad ways people have found to DOS each other over Citylink in the last 10 years, be it proxy-arp, virii, worms, IP address duplication, BGP route hijinks, spanning-tree strangeness, L2 path problems or whatever, I've rarely had anything but genuine remorse from people when it's been pointed out that they've done something wrong. There have been many accusations of intentional attack but none ever proven, and that is the way I think an Internet exchange should be.

So while I can't categorically say that it wasn't intentional, I personally don't like blaming malice when there's so much scope for basic randomness.