[lore] Murphy's Law Strikes Again: AS7007

Applications of Murphy's Law: The "AS7007 Incident" Adrian Chadd It was an average day in 1997. The Internet, fledging compared to today's standards. Internet operators (mostly!) trusted one another. SMTP servers would be open relays; a number of open web proxies and anonymous dialout servers were available. People were worried about running out of IP space. Network Operators were worried about the CPU on their routers being taxed dealing with a full routing table of ~45,000 entries. Then, suddenly, the internet stopped working. Network Operators everywhere sprang into action to discover the cause of the lack of traffic. And there it was. As far as the routing protocols were concerned, the entire internet existed in one location - some crappy Bay Networks router in AS7007. The problem was fixed rather quickly - the misbehaving router was pulled from the network. But this didn't solve the problem. Routers were still crashing all over the internet. Where were the announcements coming from? How could one stop it? Was the Internet, kept running by gaffa tape, IRC and sushi, finally coming to an end? Everything settled down a few hours later. Network Operators around the globe began discussing the impact of this outage and how it could be prevented. The Internet did fundamentally change - but unlike a lot of other changes, the general users knew nothing about it. What is BGP? BGP is the protocol which networks on the internet announce to other networks two things. That they exist, and which networks can be reached by them, and learning how to reach the other networks on the Internet. Routers will receive BGP information, decide upon the "best" path to take to a destination network and update their routing table. BGP uses a few metrics to determine the "best" path. The most obvious metric is the number of networks between them and the destination network - the "AS Path length". A shorter AS path length is generally better. This isn't the whole story but as you'll see, it didn't matter. The other metric is how specific the route is. A more specific route is preferred over a more general route, regardless of AS path length or any other metric. So if you see an announcement for 130.95.0.0/16 (ie, 130.95.0.0 -> 130.95.255.255) via path A and an announcement for 130.95.0.0/24 (ie, 130.95.0.0 -> 130.95.0.255) via Path B, traffic destined to any host inside 130.95.0.0/24 will flow via path B regardless of how much closer path A is. So there's this router in AS7007. It learnt the entire internet routing table via BGP. It began converting most routes into /24s - ie, routes which covered 256 IP addresses. Somehow, and this part is fuzzy - it then managed to "leak" this table back into BGP and reannounced to the entire internet almost every network that was available. Deaggregated down to /24's. As originating from his AS number. So the AS path was removed (ie, every network on the internet looked like it was his) and every announcement was very specific (/24). So, as far as the routers on the internet was concerned, every network everywhere could be reached by sending traffic to AS7007. And, they did. The internet existed at the end of a 45-mbit pipe, connected to AS7007. This was rectified quickly. The port was shut off and the announcements ceased. But the problem didn't go away. Routers kept passing on this massive 250,000-entry routing table and, in many cases, would then crash. They'd reboot; re-learn all the routes from a peer, re-distribute them, and crash again. Not only that, but routers worked in finite time over links which trasmitted at a finite data rate with a latency under the speed of light. These announcements bounced around the internet for hours. Many internet backbones solved the problem by turning off all their equipment, shutting off the ports, staging reloads of their equipment, adding route announcement filters to reject receiving the routes in the first place, and then turning on their network connectivity. The aftermath? Network Operators began filtering route announcements from their peers and customers. At a course level - customers could only announce networks originating from their AS numbers. At a fine-grained level - some companies only accepted route announcments matching certain criteria. This involved first registering your network inside the RADB - which you would describe your network, the networks you announced and how you connected to other networks. Most networks did something in between. Vendors began adding in "magic" into their routers to allow administrators to control how many announcements a peer could send before shutting that peer off or ignoring further announcements. The usual talk of "crytographically signed" data popped up but nothing happened for a long while. And the owner of AS7007 was never able to live it down. Disclaimers: Much hand-waiving has been done about IP routing here. I could be more specific but the article would be much, much longer. Email me if you're interested in a further explanation. References: * Someone first noticing what was going on http://www.merit.edu/mail.archives/nanog/1997-04/msg00340.html * What happened http://www.merit.edu/mail.archives/nanog/1997-04/msg00444.html * "Delayed Internet Routing Convergence" http://portal.acm.org/citation.cfm?id=347428&dl=ACM&coll=&CFID=15151515&CFTOKEN=6184618 * "Understanding BGP Misconfiguration" http://citeseer.ist.psu.edu/mahajan02understanding.html * "BGP Design Principles" http://www.riverstonenet.com/support/bgp/design/index.htm