May 4 2019

The speed of BGP network propagation

At the end of March 2019 I did a talk at the INEX’s (Ireland’s biggest internet exchange point) Annual General Meeting. I was supposed to record it, however in a brief panic due to HDMI not working on my laptop I forgot to start a recording of it. Since people found it interesting I figured I would turn it into a blog post instead:

I spent so much time on this opening side that I feel the need to include it even if I’m not doing an introduction this time.

So a long time ago… in a job fa- well a year back. We were dealing with the lovely routing design of anycast.

For those who need a quick primer, it’s a routing design that allows you to do a more natural regional based load balancing, allowing you to put server clusters in different regions and serve traffic local to those regions without having to play tricks with DNS:

This works by having all participating nodes announce the same IP prefixes globally, and with some careful routing tuning (mainly careful selection of upstream transit/peering providers) you can get good load balancing and latency results due to traffic being served more local to the visitor’s region.

Sadly, a lot of networks struggle to get consistent network announcements to work right, often resulting in totally backwards-from-logic routing:

However even those networks that do get most regions down, regions like Asia are much harder to get to route correctly, partly due to local ISPs either dealing with overloaded links or their links not always following logical geographic points.

The crux of the problem is ensuring your routing announcements are consistent over all regions and over almost all major interconnection ISPs (Tier 1’s)

In simple setups, this really just means you need to keep your AS_PATH’s as close to the same in all the regions and carriers you want to have routing control over.

This basically just means you should be attempting to use the same providers and traffic engineering parameters in all regions, AS_PATH prepending being one of the more basic ones.

However, as systems get larger and more complex, eventually a mistake is going to be made. In the case of the job, a configuration misunderstanding during maintenance of a router caused it to drop a traffic engineering prepend. This caused a huge traffic shift globally towards this router, and almost instantly overloading the site.

This was a regrettable incident, and it became clear that while the traffic engineering prepend was useful in the past, at that point in the network it was more of a liability than a useful tool. So it was time to remove it.

But what if we were to make the same mistake again? This time we are changing a lot more router configuration at once. It’s worth thinking about failure modes here. There are two ways the change could fail:

The first one being that the change applies to a large percentage of the routers over the network, but some of them fail. This would cause traffic to mostly shift away from those locations and head to other nearby sites. As long as not too many sites have this issue, this is the best way it can fail.

The nastier way it can fail is that most routers don’t end up being changed but a small percentage of them do.

This will be a repeat of the first incident, however more routers will be involved, and we would be dealing with a lot more routers that would need a rescue configuration rollback or hotfix.

The story of them changing this in a sane way is not mine to tell, however someone at the front of this change did a talk at RIPE NCC’s bi-annual event about how it was done:

The good news is, the change went through fine, and no router got left behind! A small amount of traffic churn happened while routers globally had to update their routing tables and inform the other internal routers they were connected to about that.

During this time, it was observed that not all of the providers accepted this change at the same time, some providers seemed to reconverge almost instantly, but others were noticeably slow.

This begs the question, how long does this sort of thing generally supposed to take? And are some providers better than other, who is the fastest? Who is the slowest?

However, for this we need to define what it means for a route to be propagated. There are two valid ways (in my eyes) this could be:

The “First Announcement Wins” method is quite literally what it says on the tin. When we see a BGP update message for our prefix, that provider+location combo wins (or if they are late, loses)

This could be slightly flawed, since some networks might have hard to observe mechanisms for quickly sending routing information inside their network, but those initial route updates internally may not be sensible network paths.

In “First stable announcement wins” testing is done to ensure that whatever route becomes “stable” (stops changing its internal routing in the provider backbone) is declared the winner.

In my eyes this what most network engineers are looking for, however it also has a large issue attached to it

Figuring out what is a stable route is a non-trivial amount of complexity, and no matter what way I do it, I think it is not measurable to the 10’s of milliseconds.

For this reason, the experiment we are doing is using “First Announcement Wins.”

The propagation race works like so:

The high precision timestamps are important here, and a detail that actually ended up being slightly devastating for the first few runs due to the inaccuracy of system clocks.

You see, I now have a stronger respect (in that I now actually believe they have worth) for the PPS and 10Mhz clock inputs on a lot of high-end carrier routers. Since time syncing is actually incredibly hard when you go above more than 2 systems. Locking all systems to a stable clock source is immensely nice, and before you ask, NTP does not really get that close in real life situations with a wide range of geographically separated targets.

After a lot of time syncing and timestamp offset correction, I ended up with a linear list of announcements by server location (airport codes to signify where they are, since that’s generally what the networking industry seems to use)

For AS6453 (Tata communications) times looked decent to start with. Giving that none of the BGP route update collector nodes had Tata as a direct provider, this is basically racing how fast Tata’s peers can send routes around.

It’s interesting that it seems to have a 500ms ish minimum, but after that routes start to move around the globe very fast, with the exception of EWR (New York area), Likely an outlier.

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0 0 Origin 6453 No 523.2ms 0.523 sea 6453 No 62.5ms 0.586 cdg 6453 No 6.8ms 0.593 fra 6453 No 25ms 0.618 lax 6453 No 107.5ms 0.725 yyz 6453 No 92.8ms 0.818 nrt 6453 No 12.9ms 0.831 sjc 6453 No 9ms 0.84 mia 6453 No 5.4ms 0.845 ams 6453 No 58.4ms 0.903 lhr 6453 No 22.9ms 0.926 dfw 6453 No 68.4ms 0.994 ord 6453 No 26.5ms 1.021 fra 6453 No 45ms 1.066 sin 6453 No 455.3ms 1.521 lhr 6453 No 191.2ms 1.712 syd 6453 No 19764.8ms 21.477 dfw 6453 No 171947.8ms 193.425 ewr 6453 No

For AS174 (Cogent Communications) it seems that propagation takes a little longer, due to policy on the upstream ISP used for route collection, Cogent only was imported from other carriers. So there is a similar effect with Tata here. However it is odd that Toronto (YYZ) sees the route first after announcement, Since the announcement is done in London (LHR).

This is likely the impact of a route reflector or something inside the network

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0 0 Origin 174 No 2425.1ms 2.425 yyz 174 No 2202.1ms 4.627 yyz 174 No 3328ms 7.955 yyz 174 No 1141.3ms 9.096 sea 174 No 85.1ms 9.181 lhr 174 No 158ms 9.339 syd 174 No 230.7ms 9.57 dfw 174 No 56.2ms 9.626 ewr 174 No 65.8ms 9.692 nrt 174 No 107.7ms 9.8 fra 174 No 18.7ms 9.819 lax 174 No 49.8ms 9.869 mia 174 No 33.3ms 9.902 cdg 174 No 18.2ms 9.92 ams 174 No 74.2ms 9.994 sjc 174 No 4ms 9.998 sin 174 No 16898.5ms 26.897 ord 174 No 531.6ms 27.429 dfw 174 No

For AS3257 (GTT) we are finally seeing some timing data that is based on providers we are locally connected to. GTT does seem to send things around the world reasonably fast, at a shiny 1.9 seconds (apart from EWR, thus supporting that this is more of a data point error rather than anything else)

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0ms 0 Origin 3257 No 721.1ms 0.721 yyz 3257 No 64.2ms 0.785 lax 3257 Yes 9.4ms 0.794 dfw 3257 Yes 52.9ms 0.847 sea 3257 Yes 21.4ms 0.868 sjc 3257 Yes 44.9ms 0.913 mia 3257 Yes 15.8ms 0.929 nrt 3257 No 82.9ms 1.012 fra 3257 No 20.1ms 1.032 cdg 3257 Yes 36.6ms 1.069 sin 3257 No 19.7ms 1.089 ams 3257 No 19.1ms 1.108 lhr 3257 No 256.1ms 1.364 syd 3257 No 19.1ms 1.383 ord 3257 Yes 208.5ms 1.592 lhr 3257 No 281.2ms 1.873 fra 3257 Yes 88.7ms 1.962 nrt 3257 No 114745ms 116.708 ewr 3257 No

AS1299 (Telia) has a more logical timing, 0.6 seconds after we announce in London it appears in Paris and Frankfurt directly and it is fully propagated to all nodes less than 2 seconds after that. However other carriers beat telia to their own route!, If you look at ORD (Chicago) and MIA (Miami) you can see other carriers pick up the route from telia at another location, and hand it over to our provider before 20 seconds later, it arrives as a direct route.

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0 0 Origin 1299 No 632.3ms 0.632 cdg 1299 Yes 11.3ms 0.643 fra 1299 Yes 107.1ms 0.75 sea 1299 No 76.8ms 0.827 ams 1299 Yes 17.6ms 0.845 lhr 1299 No 40.5ms 0.886 yyz 1299 No 27.6ms 0.914 mia 1299 No 59.9ms 0.974 sjc 1299 No 8.2ms 0.982 dfw 1299 Yes 9.4ms 0.991 lax 1299 No 10.9ms 1.002 ewr 1299 Yes 5.9ms 1.008 yyz 1299 Yes 61.5ms 1.07 lhr 1299 Yes 137.1ms 1.207 ord 1299 No 211.8ms 1.419 nrt 1299 No 12.3ms 1.431 sin 1299 No 499ms 1.93 nrt 1299 No 198.6ms 2.129 syd 1299 No 21135.5ms 23.265 mia 1299 Yes 3196.4ms 26.461 ord 1299 Yes

Level 3 (AS3356) does by far the worst in this test, taking 18 seconds from announcing the test prefix to it until it appears anywhere on the internet, and it appears in SEA (Seattle) of all places, and then from there other carriers pick up that route and propagate it faster than Level 3. Some 30 seconds laster Level 3 has caught up and the route is now seen in all places with Level 3 peering.

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0 0 Origin 3356 No 18508.1ms 18.508 sea 3356 Yes 365.6ms 18.874 yyz 3356 Yes 241.9ms 19.116 lhr 3356 No 251.2ms 19.367 cdg 3356 No 174.5ms 19.541 mia 3356 No 87.3ms 19.628 fra 3356 No 6.1ms 19.634 sin 3356 No 6.9ms 19.641 ewr 3356 No 53.3ms 19.694 ams 3356 No 212.3ms 19.906 dfw 3356 No 72.1ms 19.978 lax 3356 No 0.9ms 19.979 nrt 3356 No 187.5ms 20.166 sjc 3356 No 194.1ms 20.36 syd 3356 No 10094.8ms 30.455 ewr 3356 Yes 3963.6ms 34.419 mia 3356 Yes 1207.5ms 35.627 ord 3356 No 1684.1ms 37.311 sjc 3356 Yes 476.2ms 37.787 fra 3356 Yes 434.5ms 38.222 dfw 3356 Yes 1264.5ms 39.487 ord 3356 No 5106.7ms 44.594 ams 3356 Yes 1426.2ms 46.02 ord 3356 Yes 695.5ms 46.715 lhr 3356 Yes 3801.6ms 50.517 cdg 3356 Yes

Last but not least is AS2914 (NTT Communications). Who while is not the fastest at sending routes global, they did appear to be the most smooth and consistent.

Last update Wall time (s) Location Upstreaming AS Collector directly connected 0 0 Origin 2914 No 814.9ms 0.815 ewr 2914 Yes 90.6ms 0.906 fra 2914 Yes 290.2ms 1.196 lax 2914 Yes 0.6ms 1.197 cdg 2914 Yes 121.4ms 1.318 yyz 2914 No 5.1ms 1.323 lhr 2914 Yes 24ms 1.347 nrt 2914 Yes 34ms 1.381 ams 2914 No 1.1ms 1.382 dfw 2914 No 29.6ms 1.412 sea 2914 No 81.4ms 1.493 sjc 2914 No 68.8ms 1.562 sea 2914 Yes 63.4ms 1.625 ord 2914 No 57.1ms 1.682 mia 2914 No 75.7ms 1.758 ams 2914 Yes 108.9ms 1.867 sjc 2914 Yes 196.8ms 2.064 mia 2914 Yes 75.8ms 2.14 syd 2914 Yes 54.9ms 2.195 dfw 2914 Yes 0ms 2.195 ord 2914 Yes 17.5ms 2.212 sin 2914 Yes

Now that we have done all the carriers you may think that is it, however there is a different kind of propagation we can observe:

As we can race the network in sending out bgp routes, we can also race them withdrawing them!

This is a test that is harder to see on the routing table itself, so it’s easier (and much more fun) to observe it by simply doing a traceroute to a prefix and then withdrawing it from all providers:

Here you can see the route slowly being released out of all of the carriers, and then the carrier backbones, and then the carrier inter peering relationships. It also exposes some interestingly strange routing too as options to route a prefix begin to run out!

Anyway, as I said to the audience, We have had the fast bit, now we can have the furious part! If you generally like this kind of post, I aim to post once a month on various (mostly networking related) matters. If you want to stay up to date with that, you can either use my blog’s rss or you can follow me on twitter for updates when the next post happens.

I would like to thank AS57782 / Cynthia Revstrom for lending some IPv4 space for this post, and helping out on the traceroute demo you see above.

If you do have any questions about this talk, please feel free to reach out on the email that is on the slide above! Until next time!