Monday afternoon local time, Czech network operator SuproNet apparently decided that it didn't like the way traffic from the Internet flowed towards its network. So the company made a change. The result was that a number of routers elsewhere started disconnecting from and then reconnecting to their neighbors. This increased the number of routing information updates flowing through routers worldwide from a few thousand per second to 25,000 every second for several hours.

The big routers in ISP networks—and the networks of end-users connected to multiple ISPs—use the Border Gateway Protocol (BGP) to tell each other which ranges of IP addresses are used where on the Internet. As the "border" part suggests, the protocol is run between routers belonging to different organizations, usually with a strong economic interest in what BGP does for them. As such, the protocol is a huge experiment in cooperative real-time computing on a worldwide scale.

Whenever a new BGP-speaking router joins the existing network, it injects new information which is then copied to all other BGP routers worldwide. Every range of IP addresses ("prefix" in BGP parlance) has a number of "path attributes" attached to it. An important one of these is the AS path: this is a list of the Autonomous System numbers of the network operators that provide connectivity to the destination in question. The AS path allows BGP to avoid loops: when the local AS is already present in the AS path for a prefix, a router won't accept the prefix to avoid packets from circling around the network.

Another function of the AS path is to convey preference: short AS paths are better than long ones. So if you're connected to two ISPs, and you would prefer to move some incoming traffic from one ISP to the other, you could artificially increase the AS path that you present to the ISP that you want to slow down. Others then see a shorter AS path through the more preferred ISP and will probably decide to send traffic to you through that ISP. It looks like this is what SuproNet did—a little too enthusiastically: the resulting AS path was more than 255 ASs long. There is nothing in the BGP protocol that says this isn't allowed, but this triggered one very old Cisco bug, and probably another, newer bug.

The old bug was fixed in 2003 and kicks in whenever a path contains more than 126 AS numbers. In this case, BGP uses a different length encoding, which older versions of the Cisco IOS software didn't handle correctly. So the next router would see an invalid BGP packet and close the session. Shortly after that, the routers involved would reestablish the session, send a whole bunch of updates to other routers, encounter the same problem again and break off the BGP session again... ad infinitum.

The other bug is less publicized, but apparently this one kicks in at 255 ASs in the AS path. It seems unlikely that very many routers involved in BGP are still running software more than five years old, but the other bug was fixed much more recently, so a good number of people are still running IOS versions in their routers that are vulnerable. But it looks like for now these routers are in the clear:

Network Next Hop Metric LocPrf Weight Path * 94.125.216.0/21 157.130.10.233 0 701 174 25512 47868 i

In his book "Internet Routing Architectures" about the BGP protocol, Sam Halabi writes, "Some people are surprised when networks fail. I'm surprised when they don't." I'm with you, Sam. More details in the Renesys blog.