The numbers here are from the data available as of 5/14, so it may differ slightly from the current numbers. You can see up-to-date data at a site that I have setup here: http://netneutrality.computer.

Over the past week, I’ve been studying some interesting data: the public comments on an FCC proposal. If you haven’t heard about the FCC’s current proposal, you should watch John Oliver’s recent segment. In short, the FCC is proposing to reclassify ISP’s as Title I carriers, reverting the previous administration’s decision.

Back in 2014 (After another John Oliver segment), I had tried to crawl the FCC’s site to cache and study the public comments, but they had severe load issues, and there wasn’t a good API. But this time around, they have a great API. I wrote some Python to crawl the public comments, hoping that I could spot some trends in the data, and perhaps determine the overall sentiment.

My main questions were:

Is there any bot activity? How much? What does the bot activity look like? What does the overall sentiment look like? Is it mostly Pro-Title II, or Anti-Title II?

There are two primary avenues to create a comment: you can use a form on the FCC’s site (as John Oliver directed), or you can submit comments via the API [EDIT: Since publishing this, I have found that there’s also a CSV submission method]. This means that there are many off-site forms that will allow you to submit a comment (which is totally above-board, as long as you’re not botting).

First, I wanted to figure out how to differentiate on-site comments from API comments. By looking at my own on-site comment (which you can see here), I quickly noticed that “proceedings” field was quite beefy, while the API only specified three keys to be used in that property. For example, my comment has a field:

{“proceedings”: {“total”: 607069}}

This is clearly a count of the comments at the time that I submitted mine — something that offsite comments would lack.

Using these proceedings keys to differentiate on-site comments from API comments, I ran the numbers, finding that we have 595,479 onsite comments, and 821,866 API comments. This was a little surprising, as I had expected there to be more on-site comments. However, when looking at the data, it’s clear that the API comments have had a huge boost recently:

On-site comments per day

API comments per day

Next, I wanted to look at the most common messages. In order to group the messages, I calculated a “fingerprint” for each, so that small changes in capitalization and punctuation will still result in a match. For on-site comments, the most common messages are:

You’ll notice that these are all pretty similar (they mirror what John Oliver suggested that people should write), but they do not appear to be botted or submitted by forms. There’s a ton of variation on the onsite messages, and there’s a very long tail in the message text. If there’s any significant botting here, they’re doing a good job of disguising it.

However, when it comes to API comments, there’s much a much different picture. Looking at the most common text:

Here, we see four anti-Title II messages, totaling over 700k comments, the lions share of the API comments.

The “Unprecedented” message is the one that most people have noticed — it looks like this:

The unprecedented regulatory power the Obama Administration imposed on the internet is smothering innovation, damaging the American economy and obstructing job creation. I urge the Federal Communications Commission to end the bureaucratic regulatory overreach of the internet known as Title II and restore the bipartisan light-touch regulatory consensus that enabled the internet to flourish for more than 20 years.

A number of journalists have tried to trace where this message is coming from, and the best guess seems to be the Council for Individual Freedom. This was a mentioned in a recent Gizmodo article, where the organization claimed that they were running an ad campaign that directed users to a form, and they provided a screenshot of that form. (Side note: if anyone can find this form, I would love to see it live).

What indicates that these are bots?

The comment rates are suspicious

Let’s look at some comment rates from this source, zooming in on a specific time period:

Comments from the “unprecedented” bot, grouped into 10 minute buckets.

That looks a lot like a bot! There’s a near-constant rate of comments, punctuated by periods of zero comments, as if the bot was turning on and off. Now, there are plenty of periods of zero comments in the data — the FCC’s system has been going down occasionally, but this was not one of those periods. For an apples-to-apples comparison, let’s take a look at data from “Battle for the Net” over the same time period:

Comments from the “Battle for the Net” form, grouped into 10 minute buckets.

Note that there’s a constant trickle of comments, and the comments never drop to zero. This would lead me to believe that this is a legitimate form, and not bot-driven. Here’s a sample of the “outraged” bot:

Comments from the “outraged” bot, grouped into 10 minute buckets

The bot data isn’t always this “flat”, but it does always have these blank spots every few hours (or less). More recently, the outraged bot has been running a lot more, and has been on almost constantly.

Now, “Free Our Internet” and “Taxpayers Protection Alliance” don’t behave this way.

[EDIT: It turns out that these are being submitted via CSV, so this spacing is normal for that submission method. However, I still find the volumes (and the data itself) to be suspect.]

The top is the activity from “Free Our Internet”, the middle is “Taxpayers Protection Alliance”, and the bottom is all other activity over that time period.

The data is extremely consistent.

When people fill out forms, they don’t do it consistently. Out of the almost 450k “outraged” and “unprecedented” comments, exactly 9 do not provide a full address, name and email. The addresses, names, and emails are also formatted very consistently — like they came from a database. Someone has gone out of their way to make these seem like real submissions. When doing a simple spot-check of the data, it becomes clear that the name/email/address data looks very, very real.

There are very few email confirmations

If you choose, the FCC can provide an email confirmation of your comment, and this choice is stored in the data. There are three possible values: “true”, “false”, or the field is just missing. If the field is missing, the comment was probably filed over the API, and the form provider didn’t include the ability for you to get an email confirmation from the FCC. Obviously, if you were building a bot, you would never want to have email confirmations, especially if you were using other people’s information.

[EDIT: Since publishing, I realized that “Free our Internet” and “Taxpayer Alliance” both submit over CSV, which doesn’t give a field for emailConfirmation, so this is not anomalous in those cases]

Here’s a table of the what emailConfirmation values for these sources look like, broken down by message:

--------------------------------------------

| Source | True | False | Missing |

--------------------------------------------

| unprecedented | 68 | 5 | 147528 |

--------------------------------------------

| freeourinternet | 0 | 0 | 181122 |

--------------------------------------------

| taxpayeralliance| 0 | 0 | 96177 |

--------------------------------------------

| battleforthenet | 23316| 33 | 300 |

--------------------------------------------

| outraged | 1 | 2 | 13372 |

--------------------------------------------

If Battle for the Net is botting, they are playing a risky game, or else they have control of 24k email addresses.

Journalists have yet to find a real person who says they actually submitted a comment

This would be consistent with the idea that the bot programmers are using real people’s data, trying to overwhelm the comments with seemingly real people. In fact, prior to John Oliver’s piece, the “outraged” bot was roughly 37% of the total comments. If we assume that the FCC was going to clean the data first — removing anonymous comments, etc — the outraged bot might have ended up being roughly 50% of the total comments, due to the fact that it has consistent data.

There are a lot of “breached” accounts in the bot comments

I checked a sample of the comments against the API from haveibeenpwned.com, looking only for accounts the were in breaches that included physical addresses. I wasn’t able to check all of the comments due to rate limits, but I took random samples of 1000 emails from various sources (and a control sample), and calculated the percentage of the accounts involved in breaches (mostly River City Media and Modern Business Solutions):

“Unprecedented”: 67.4%

“Free Our Internet”: 74.2%

“Outraged”: 64.2%

“Battle for the Net”: 33.5%

“John Oliver Viewers”: 20.5%

Control Sample: 31.5%

Now, of course, the bot data is a lot cleaner, so that would skew the percentages, as a fake email would come up “clean”. But these numbers seem pretty stark, and would indicate to me that the bot programmers are working with breach data directly, or with a data warehouse whose lists ended up in one of these breaches.

These forms don’t seem to be getting much real traction online

People aren’t tweeting about these forms, and they don’t seem to be getting much traction on Facebook. In some cases, I have no idea where the form actually is! It seems insane that they would be getting this many real people to fill them out, and yet there’s no real social traction.

So, what does this all mean?

It seems quite clear to me that there are groups using bots to manipulate the outcome of this public comment period. But what does that leave? When we subtract the bots, what is the public sentiment about this proposal? Doing some rudimentary string matching, the numbers I got are:

Pro Title II: 395,353

Anti Title II: 743

Now, these numbers are certainly not final, and they may not even be accurate. It’s hard to tune my algorithm as I can find very, very few non-bot comments that are in favor of the proposal. I would encourage you to look at the comments that I wasn’t able to easily categorize:

http://netneutrality.computer/browse?titleii=unknown

I think you’ll quickly see that there is an huge amount of real resistance to the proposal, and in favor of Title II regulations for ISP’s.

[UPDATE: For another look at this sentiment data, check out this excellent post: http://jeffreyfossett.com/2017/05/13/fcc-filings.html]

I’ll leave you with some thoughts from real people behind the comments, both pro and con: