A few months ago, a complaint started popping up from users downloading or updating our apps: “Geez, your downloads are really slow!”

If you work in support, you probably have a reflexive reaction to a complaint like this. It’s vague. There’s a million possible factors. It’ll probably resolve itself by tomorrow. You hope. Boy do you hope.

Except… we also started noticing it ourselves when we were working from home. When we’d come in to the office, transfers were lightning fast. But at home, it was really, seriously getting hard to get any work done remotely at all.

So, maybe there was something screwy here?

The Video

Before digging in, here’s this story in convenient summarized video form, if you’d prefer!



[youtube https://www.youtube.com/watch?v=yh3touL9eqg&rel=0&w=750&h=422]

Now on to the details.

The Test

The Panic “network topology” is actually very simple. The Panic web servers have a single connection to the internet via Cogent. We colocate our own servers, rather than using AWS or any other PaaS, and we also don’t currently use a CDN or any other cloud distribution platform.

So, if something is making our downloads slow, it ought to be pretty easy to do some analysis and figure out why, or at least where.

We wanted to know three things:

How fast can people download from our website?

How fast can people download from a “control” website that’s not on our network?

on our network? What are people using for their internet provider?

We made an extremely simple test page that transfers 20MB of data from our server to the browser, then sends the user to run the same script on the control server, which we chose to host with Linode. (The Linode server is located in Fremont, CA, the closest we could find to us here in Portland.)

We tweeted the link out, and data started pouring in…

The Results

Here’s what we got back, comparing how fast our users could download from our control server through Linode, and from our own servers through Cogent:

(There are 1,645 samples in our target range, after filtering out TLDs with fewer than 10 occurrences, and we’ve done a box plot, which shows a spread of all the data points.)

Well, well, well. It doesn’t take statistical genius to see one glaring outlier — and that was Comcast, with download speeds often being as low as 300 kilobytes/second. And you’ll never guess what provider is used by virtually every Panic employee when they work from home? Yeah, Comcast. There is, in fact, no other cable ISP available to Portland residents.

But, before jumping to conclusions, there was something else that was weird with the Comcast data: a huge number of outliers, way more outliers than any other provider. See all those red dots on the graph, ranging from very slow to very fast?

The answer to that mystery was solved when we plotted out Comcast data across different times of the day…

Nuts. The problem reports we’d been hearing were indeed a real thing.

Our downloads really were slow — but seemingly only to Comcast users, and only during peak internet usage times. Something was up.

At first we thought, maybe Comcast bandwidth is just naturally more congested in the evening as people come home from work and begin streaming Netflix, etc. But that didn’t explain why the connections to our Linode control server from Comcast, during the exact same time windows for each tester, were downloading with good speeds.

We wondered, is Comcast intentionally “throttling” Cogent customers? And if so, why?

The Why

Peering.

Major internet pipes, like Cogent, have peering agreements with network providers, like Comcast. These companies need each other — Cogent can’t exist if their network doesn’t go all the way to the end user, and Comcast can’t exist if they can’t send their customer’s data all over the world. One core tenet of peering is that it is “settlement-free” — neither party pays the other party to exchange their traffic. Instead, each party generates revenue from their customers. Cogent generates revenue from us. Comcast generates revenue from us at home. Everyone wins, right?

After a quick Google session, I learned that Cogent and Comcast have quite a storied history. This history started when Cogent started delivering a great deal of video content to Comcast customers… content from Netflix. and suddenly, the “peering pipe” that connects Cogent and Comcast filled up and slowed dramatically down.

Normally when these peering pipes “fill up”, more capacity is added between the two companies. But, if you believe Cogent’s side of the story, Comcast simply decided not to play ball — and refused to add any additional bandwidth unless Cogent paid them. In other words, Comcast didn’t like being paid nothing to deliver Netflix traffic, which competes with its own TV and streaming offerings. This Ars Technica article covers it well. (How did Netflix solve this problem in 2014? Netflix entered into a business agreement to pay Comcast directly. And suddenly, more peering bandwidth opened up between Comcast and Cogent, like magic.)

We felt certain history was repeating itself: the peering connection between Comcast and Cogent was once again saturated. Cogent said their hands were tied. What now?

The Fix

There was only one last hope: get Comcast to fix it. I know, like we were somehow going to convince this 200 billion dollar corporation to add more capacity to their interconnection with Cogent. If I asked you to rate the possibility of that actually happening on a scale of “no” to “never”, you’d probably pick “come on man are you serious”, right?

But after a lifetime of being a “hey, it’s worth a shot” guy, I had to try. I did a real quick Google search for Comcast corporate contacts and found a person who seemed like they were involved in network operations PR, and I fired off a quick e-mail explaining the situation to Comcast.

And then, the craziest thing happened…

They wrote back quickly. Not only that, but they were on it. We set up a phone call. They took us seriously, they wanted to know the backstory, they wanted to know what our customers were seeing, and they were going to talk to the right people — they even e-mailed Cogent to connect with the right person in peering over there.

And pretty soon a call came back with a definitive-sounding statement: “Give us 1 to 2 weeks, and if you re-run your test I think you’ll be happy with the results.”

Sure enough, we waited two weeks, had our users re-run the speed test, and wouldn’t you know it…

…the problem was essentially gone. Comcast really did fix it. We were now able to measure our Comcast download speeds in megabytes/second instead of kilobytes.

According to Comcast, two primary changes were made:

Comcast added more capacity for Cogent traffic. (Exactly as we suspected, the pipe was full.) Cogent made some unspecified changes to their traffic engineering.

Here’s where I have to give Comcast credit where credit is due: they really did care about this problem, and they really did work quickly to make it go away.

(One weird thing, though: I was so prepared for a total Comcast dead-end, so sure that Comcast would never even reply, let alone help, that this incredibly positive outcome made me feel suspicious: why me? Why was I able to get this corrected with an e-mail when Cogent couldn’t?

It felt like there was no way this should have worked. If I had to guess, I’d say it’s simple: in the middle of a serious ongoing debate over net neutrality, the last thing Comcast wanted to look like was a network-throttling bad guy in this blog post. But then again, maybe I’m still being too cynical — maybe they just saw a problem they hadn’t noticed and fixed it. (But really, did they really not notice that pipe was full until I asked? Surely there are network monitoring tools?) Frankly, I have to stop thinking about it, because I’ll never know. But no matter the reason, I’m very grateful: thanks for listening to us, Comcast.)

What Does This All Mean

I’d summarize it as follows:

The internet is fragile — and that’s pretty scary.

And while this story amazingly had a happy ending, I’m not looking forward to the next time we’re stuck in the middle of a peering dispute between two companies. It feels absolutely inevitable, all the more so now that net neutrality is gone. Here’s hoping the next time it happens, the responsible party is as responsive as Comcast was this time.

Check Our Work

All of our data, our data analysis scripts, and more, is available at this GitHub repository. You can even click the button in the readme and it will take you to a running JupyterHub notebook where you can play with the data yourself, live in your browser. If you find any insights, or mistakes, please let us know.