Another archival post from the Citylink blog

Loopback Saturday revisited

Since the earlier writeup about the big fault of Oct 6th, I've received questions from various folks, and we've gained further data from the network that has caused us to revisit the conclusions of the first writeup. Most significantly, we think the network fault was caused by hardware failure, rather than MAC table overflow.

Last week, we had a switch in the Majestic Centre lock up - it wasn't pingable or manageable, and the customers in that building were disconnected from the network. This is a relatively uncommon occurrence - we've had very few real hardware failures with Cisco gear (fan and PS failures), and the IOS loads for the lowend L2 Cisco switches are generally bombproof - not unsuprising, since most of the magic happens in hardware. So other than the early 1548's that were so unreliable they got sent back, we haven't had more than half a dozen actual Cisco switch lockups in the last 10 years.

So, when the switch in the Majestic Centre crashed during the main outage, and went offline again 20 days later, that piqued our interest. The switch was showing errors much like these when we got onto the console:


This was at 2am on Oct 26, so after a little Googling suggested that this was a hardware error, we swapped the 2950 for a 2960, and the on-call guys went back to bed. Following up, we found a cisco bug report that says:

Under certain level of traffic load, the switch will start logging the following messages on the console:


And after a few seconds, the switch will stop passing any traffic.

In some cases, the switch seemed still forwarding broadcast and multicast traffic, which will cause STP problem if the switch has redundant link and is not supposed to be the root for the VLAN, as both port will go forwarding.

Two units were returned by CISCO. The units were re-screened to the latest test program, and failed the SDRAM memory test.

Customer should RMA unit back to Cisco.

During the Oct 6th outage, the Majestic Centre 2950 was on a ring, and should have been blocking on some ports. After that outage, we singlehomed all the 2950's that were multihomed, including the Majestic Centre switch, so when it failed on Oct 26, it wasn't in a position to close a loop.

So, this causes us to reconsider some of our conclusions about what caused Oct 6th. While still possible, the "something injected lots of MAC addresses" hypothesis is no longer our strongest candidate root cause for the spanning-tree instability - it now seems more likely that a hardware fault kicked off the stability problems on the network.

In terms of preventing the problem happen again, our plan hasn't changed - we've implemented all the changes discussed in the earlier post, except for the MAC filtering, which should go live shortly (there is a fair amount of latitude for it to go wrong, so we are being careful).

Various questions asked over the last few weeks:

  • "did you question everybody connected to the last half dozen switches to see if any of them were doing anything random"?

    Yes. I didn't speak to all of them personally, but we got around most of them in the following few days. I've no reason to believe any of them was doing anything out of the ordinary, I'm prefer the "we pulled it all apart, and then put it back together, and it worked" explanation.

  • "what do you mean by 'out of network' management"

    That wasn't my first choice of phrase - I originally wrote "out of band", but we changed it for reasons that I now don't recall.

    Currently, we manage the network from an administration VLAN within the network itself. We're careful about it, we follow the Cisco BCP's (ie, don't use VLAN 1, access lists for logins/tacacs/logging, we don't allow the management VLAN to be expressed on any ports in the field, other than the interswitch trunk links), and we monitor ARP/MAC noise inside the VLAN. This has worked as our primary mgmt method for many years, so we've never bothered to implement anything "ex network".

    At the moment, we're tossing up what the best way of ensuring emergency access to switches may be - we're looking at various mobile 3G/RT based services, and are considering ADSL/dialup, but it's looking like the best option may be to construct a second ethernet on other fibre. With single-fibre SFPs getting cheaper over the last few months, it maybe that we convert our existing dual fibre circuits into a pair of single fibre circuits, and run an independant ethernet for management.

  • Citylink eventually built an out of band network to the core nodes using CWDM wavelengths and seperate switches at the core nodes, with serial console terminal servers. It proved quite useful, particularly as 7609's seem to regularly forget their VTP config when rebooted. *

  • How did you get this past "legal"? (a question mainly from folks outside NZ).

    New Zealand isn't a particularly litigious culture - Citylink doesn't have a legal team, we contract legal services in on the odd occasions we need them (which isn't that often). It never occurred to me that what I was writing might materially affect Citylink's legal position, and as far as I know it hasn't.