As shown in the box plot above, there isn’t too dramatic a difference among the news organizations as rated by redditors.

By content

Florida’s “sunshine laws” allow public access to government materials, including arrest records. These laws’ effect on the “Florida Man” phenomenon is clear when looking at the verb distribution below. Verbs like “arrest”, “steal”, and “shoot” are very common.

Many of the top verbs, highlighted in red, show how Florida Man headlines largely pertain to items likely found in police records. Florida’s “Sunshine Laws” allow public access to police records.

WEB APP

Here is a link to my headline explorer. The platform allows you to easily sift through the database. You can sort alphabetically or by frequency.

METHODOLOGY

The Reddit API

Data was pulled using Reddit’s API. Using Reddit’s praw Python library is relatively straightforward, making it easy to collect URLs. The following few lines of code were all that was necessary:

The full code can be found in this notebook. Each page object provided by the API contains100 posts, and unfortunately, Reddit only provides the top 1000 reddit submissions, so our database cannot be expanded to the entire subreddit. (Advice on how to get more data is welcome!)

Why Reddit?

When compared to Twitter, Reddit’s data is easier to parse. The main “Florida Man” handle on Twitter includes images and commentary, whereas Reddit posts are often just a URL with a headline. Furthermore, the “Florida Man” subreddit has 640k followers, and the Twitter account has 412k followers, so they provide similar indicators for an article’s popularity.

Headline parsing with NLTK

I parsed the headlines using NLTK (the Natural Language Tool Kit), a python library that parses a sentence and can give the parts of speech. In order to standardize the verbs for the analysis and web app, we can also use the WordNetLemmatizer function to put all verbs into one tense (e.g. “give”, “gave”, “giving” would all be compressed to “give”).

The function below helped me properly sort through the verbs in headlines, though some manual entry and cleaning was required.

Cleaning Data

Reddit-specific posts were dropped and some posts (like cartoons or mugshots) were not conducive to the web app’s format and were removed. Basic cleaning and verb parsing are documented in this notebook. While many headlines were double-checked by hand, there are definitely repeats and errors that still exist.

Data and notebooks

If you are interested in seeing the scripts or the data, it’s available at this link, and the full cleaned dataset is available here.