Trucks, beer, and love, all things that make country music go round. I’ve said before that country music is just pop music with a slide, and then lyrics about slightly different topics than what you’ll hear in hip hop or “normal” pop music on the radio.

In my continuing quest to validate my theory that all country songs can fit into one of four different topics, in this post, I go through lyrics to see which artists talk about trucks, beer, and love the most. In my first post on this topic, I talked about how to get song lyrics from genius and print them out on the command line.

The goal here, and what I’m going to walk you through, is how I stored stored info and lyrics for all the songs for the country artists, how I made sure that all the lyrics were unique, and then ran some stats on the songs. Another note before we go is that a lot of data work is just janitorial. The actual code for getting “interesting” results is fairly simple. The key it to enjoy doing the janitor-style coding and then you’ll be good.

If you’re interested in which country music people talk most about trucks, beer, alcohol, or small towns, skip to the end where I list out some stats. For the rest, here’s some code.

Step 1 — Save the Lyrics!

When doing anything with web scraping, the one thing to always, always keep in mind here, is that you want to avoid hitting the server for as little as possible. With that in mind, we’re going to do here is assume the inputs are names of artists. For each of those artists, find all of their songs, and then for each of those songs, grab the lyrics in the way that I did in the first post, and then save them locally along with some meta information the API provides.

Now when I post the following code, don’t imagine that I knew what I wanted . Everything in here was created iteratively. Here’s a list of all the features of this piece of code does that were created iteratively.

Directory structure — Within the folder that contains the main .py file, there’s a folder named artists. And within that folder, when the code runs, a folder with the artist’s name is created (if not already). And within that folder, there are two more folders, info and lyrics. When we run the code, I put the lyrics in /artists/artist_name/lyrics/Song Title.txt and the info from the API, containing information about the song, like annotations, title, and song API id so we can grab it again if need be, in the file /artists/artist_name/info/Song Title.txt. The key, again, being saving all the info given to avoid unnecessary requests.

Redundancy Checking — Along with making sure to save all the info given, if we run an artist for the second time, we don’t want to get lyrics that we already have. So once we have all the songs for that artist, I run a check to see if we have a file with the name of the song already, and that the file isn’t empty. If the file is there, we continue to the next song.

Lyric Error Checking — Ahh unicode. While great for allowing multitudes of different characters rather than the standard English alphabet along with a few specialty characters, they’re not ideal when I’m trying to deal with simple song lyrics. And when saving the lyrics, I encountered more than a few random, unnecessary characters that Python threw errors for encoding problems. In a semi-janky rule-based solution (which isn’t great to use, see below), when I saw these errors being thrown, I would specifically replace them with the correct “normal” character. I assume there’s some library out there that would take care of all the encoding issues, but this worked for me. Also, on Genius’s end, it would be sweet if they, you know, checked for abnormal characters when lyrics were uploaded and didn’t have them in the first place. Also would be cool if they included the lyrics in the API.

def clean_lyrics(lyrics): lyrics = lyrics.replace(u"\u2019", "'") #right quotation mark lyrics = lyrics.replace(u"\u2018", "'") #left quotation mark lyrics = lyrics.replace(u"\u02bc", "'") #a with dots on top lyrics = lyrics.replace(u"\xe9", "e") #e with an accent lyrics = lyrics.replace(u"\xe8", "e") #e with an backwards accent lyrics = lyrics.replace(u"\xe0", "a") #a with an accent lyrics = lyrics.replace(u"\u2026", "...") #ellipsis apparently lyrics = lyrics.replace(u"\u2012", "-") #hyphen or dash lyrics = lyrics.replace(u"\u2013", "-") #other type of hyphen or dash lyrics = lyrics.replace(u"\u2014", "-") #other type of hyphen or dash lyrics = lyrics.replace(u"\u201c", '"') #left double quote lyrics = lyrics.replace(u"\u201d", '"') #right double quote lyrics = lyrics.replace(u"\u200b", ' ') #zero width space ? lyrics = lyrics.replace(u"\x92", "'") #different quote lyrics = lyrics.replace(u"\x91", "'") #still different quote lyrics = lyrics.replace(u"\xf1", "n") #n with tilde! lyrics = lyrics.replace(u"\xed", "i") #i with accent lyrics = lyrics.replace(u"\xe1", "a") #a with accent lyrics = lyrics.replace(u"\xea", "e") #e with circumflex lyrics = lyrics.replace(u"\xf3", "o") #o with accent lyrics = lyrics.replace(u"\xb4", "") #just an accent, so remove lyrics = lyrics.replace(u"\xeb", "e") #e with dots on top lyrics = lyrics.replace(u"\xe4", "a") #a with dots on top lyrics = lyrics.replace(u"\xe7", "c") #c with squigly bottom return lyrics

Check out the most of the main function below. If you’re looking for the actual full file, check out this gist. It’s easier to post that on Github than formatting the entire thing here.

def song_ids_already_scraped(artist_folder_path, force=False): #check for ids already scraped so we don't redo if force: return [] song_ids = [] files = os.listdir(artist_folder_path) for file_name in files: dot_split = file_name.split('.') #sometimes the file is empty, we don't want to include if that's the case if dot_split[1] == 'txt': try: song_id = dot_split[0].split("_")[-1] if os.path.getsize(artist_folder_path + '/' + file_name) != 0: song_ids.append(song_id) except: pass return song_ids def info_from_song_api_path(song_api_path): song_url = base_url + song_api_path response = requests.get(song_url, headers=headers) json = response.json() return json def songs_from_artist_api_path(artist_api_path): api_paths = [] artist_url = base_url + artist_api_path + "/songs" data = {"per_page": 50} while True: response = requests.get(artist_url, data=data, headers=headers) json = response.json() songs = json["response"]["songs"] for song in songs: api_paths.append(song["api_path"]) if len(songs) < 50: break #no more songs for artist else: if "page" in data: data["page"] = data["page"] + 1 else: data["page"] = 1 return list(set(api_paths)) if __name__ == "__main__": for artist_name in artist_names: #setting up path to artist's directories artist_folder_path = "artists/%s" % artist_name.replace(' ', '_').lower() artist_lyrics_path = "%s/lyrics" % artist_folder_path artist_info_path = "%s/info" % artist_folder_path if not os.path.exists(artist_folder_path): os.makedirs(artist_folder_path) if not os.path.exists(artist_lyrics_path): os.makedirs(artist_lyrics_path) if not os.path.exists(artist_info_path): os.makedirs(artist_info_path) #only using lyrics since they're saved second prev_song_ids = song_ids_already_scraped(artist_lyrics_path) #find the artist! search_url = base_url + "/search" data = {'q': artist_name} response = requests.get(search_url, data=data, headers=headers) artist_info = response.json() for hit in artist_info["response"]["hits"]: song_api_path = hit["result"]["api_path"] artist_api_path = artist_id_from_song_api_path(song_api_path, artist_name) if artist_api_path: #done searching if we found the guy break if not artist_api_path: print "Could not find %s" % artist_name #find the song api ids for the artist song_api_paths = songs_from_artist_api_path(artist_api_path) #print out how many songs we have left print len(song_api_paths) - len(prev_song_ids) for song_api_path in song_api_paths: api_id = song_api_path.split('/')[2] if api_id in prev_song_ids: continue #don't redo full_song_info = info_from_song_api_path(song_api_path) song_title = full_song_info["response"]["song"]["title"] song_title_path = song_title.replace('/', '_')#.replace(' ', '_').lower() song_web_path = full_song_info["response"]["song"]["path"] lyrics = lyrics_from_song_web_path(song_web_path) lyric_path = "%s/lyrics/%s_%s.txt" % (artist_folder_path, song_title_path, api_id) info_path = "%s/info/%s_%s.txt" % (artist_folder_path, song_title_path, api_id) #for record keeping purposes print lyric_path with open(info_path, "w") as lfile: lfile.write(json.dumps(full_song_info)) with open(lyric_path, "w") as ifile: try: ifile.write(lyrics) except UnicodeEncodeError as error: print error

Running this piece with a giant array of country music artists, and after a while, you’ll have a giant directory full of lyrics to run and play with.

Step 2 — Creating a Copy of the Lyrics

First thing, I want to copy over the lyrics directory from the base directory I named “lyrics” to another one, I’ll call “lyrics_orig” because I couldn’t think of a better name at the moment. Reason for this is because I want to keep a record of all the lyrics I downloaded in the first place. That’s valuable information, for example, if I ever wanted to go and look at the full range of songs that I gathered the first time I ran the script. Just like with saving the info from the API, I don’t want to remove information if I don’t have to. Below is the code for looping through the artists, and copying the files over to the new dir.

import os import shutil lyric_path = "artists/%s/lyrics_orig" lyric_orig_path = "artists/%s/lyrics_orig" song_path = "artists/%s/lyrics/%s" lyric_song_orig_path = "artists/%s/lyrics_orig/%s" artists = os.listdir(artist_path) for artist in artists: artist_lyrics_path = lyric_path % artist artist_lyric_dedup_path = lyric_orig_path % artist if not os.path.exists(artist_lyric_orig_path): #create secondary folder os.makedirs(artist_lyric_orig_path) for f in os.listdir(artist_lyrics_path): orig_song_path = song_path % (artist, f) dup_song_path = lyric_song_orig_path % (artist, f) shutil.copy2(orig_song_path, dup_song_path)

Cool, now I feel comfortable destroying some of the files in the lyrics folders since I know I have a backup.

Step 3 — Removing Duplicates

This is the meat, of what I’m trying to do here, so listen up. In order to get accurate information on who sings about trucks, we need to make sure that we don’t have any duplicate song lyrics so lyrics don’t get double counted.

I’m pretty happy with the solution I came up with, but I also want to point out here that I didn’t come up with that in my first attempt. This is real world data and finding an angle of attack doesn’t just come first try. So I’m going to outline my failures first, before showing the code and what I came up with that actually worked.

Attempt 1 — Title Rules

Most of the duplicate songs I see are those that are the same song, just recorded on a different album. A song released as a single (Raise Your Bottle and Raise Your Bottle (Single)), a live version of a song (Who Knows and Who knows – live from bonnaroo), or a title that also credits the featured artists as well (Texas Boys and Texas Boys with Pat Green & Josh Abbott), or when they just have different spellings for the names (Chattahoochee and Chattahoochie). Man, some of those song names are pretty brutal.

Anyway, my first thought for removing those songs was to look for keywords like “(Single)” or “Live”, I should be able to pick off the ones .

In general, rule based learning is tough to get right because you need deep knowledge of your data, and often you’ll get quickly overwhelmed with the number of rules needed as well as the number of one off cases that present themselves. It’s best to avoid this. Remember above when I used rules to remove the Unicode oddities? Yeah, that was no fun, but the difference with that is there are a limited number of Unicode characters and there are correct replacements.

Attempt 2 — Beginning of Song Titles

If you look above, many of the duplicate songs have the same title as the “original” song, but then have an extra phrase tacked on to the end. Phrases like (Single), or with Pat Green & Josh Abbott. So I figured I might be able to grab the duplicates by comparing the title of a song to the other titles, and if any of the other titles starts with the the full title of the comparer, we’d have a duplicate. It worked alright, but I saw I was missing too many songs. That Chattahoochie/y song wouldn’t be caught because it’s a different spelling, and then even the Who Knows wouldn’t be caught because the ‘K’ in ‘Knows’ is lowercase for some reason on Genius for the live version of the song. I could have just lowercased all the letters in the title to catch that, but it seems in elegant and just forcing a method that wasn’t ideal.

There are just too many different reasons beyond what I listed above for duplicate titles, so there isn’t a simple way of going through and writing rules for removing those songs. Also, no way I’m going to go through all of that by hand.

Attempt 3 — Lyric Matching

I didn’t want to initially, but after failing at everything having to do with titles, I finally succumbed to the call of the lyrics and used those to remove the duplicate songs.

Here’s what I did. For each song, I read in the lyrics, remove new lines from the file, make all the letters lowercase, split it into the different words, and then put those into a set. Those are the different tokens in the song. Once that’s done I loop through all the different song word sets, and use difflib’s SequenceMatcher to compare the similarity of the words in the different song. The SequenceMatcher gives me a ratio for each of the comparisons. If the comparison ratio is greater than 0.5, then I consider that a match, and I use some logic to pick which for the titles is up for deletion (using length of the title), and save the path of the song for later deletion!

Quick note on the 0.5 cutoff. Because of the nature of scraped data from the internet, I can’t just assume that the sets of words in the lyrics would be the same for the duplicate songs. So once I had the measurement I wanted, I played around with that number and looked at the different ones returned, and 0.5 seems like a good one. I observed the matches and their ratios were pretty much all either above 0.7, or under 0.3 with a nice chasm between the two. If that number were continuous, finding a cutoff would be difficult because we want perfect removal of the duplicate songs, and in that case I’d need to find a new way to measure the similarity of the lyrics.

Here’s the code.

import os from difflib import SequenceMatcher as sm import shutil artist_path = "artists" lyric_path = "artists/%s/lyrics_dedup" lyric_dedup_path = "artists/%s/lyrics_dedup" info_path = "artists/%s/info" song_path = "artists/%s/lyrics/%s" lyric_song_dedup_path = "artists/%s/lyrics_dedup/%s" artists = os.listdir(artist_path) for artist in artists: artist_lyrics_path = lyric_path % artist paths_for_removal = [] songs = [] titles = [] for title in os.listdir(artist_lyrics_path): artist_lyrics_song_path = song_path % (artist, title) words = open(artist_lyrics_song_path).read() words = list(set(words.strip().replace('

',' ').lower().split())) songs.append((title, words)) titles.append(title) for index, (title, word_list) in enumerate(songs): for compare_index, (compare_title, compare_word_list) in enumerate(songs[index+1:-1]): sm_instance = sm(None, word_list, compare_word_list) ratio = float(sm_instance.ratio()) if ratio > 0.5 and ratio < 1.0: print "%s : %s : %s" % (str(ratio), title, compare_title) if len(title) == len(compare_title): title_to_remove = compare_title if title < compare_title else title elif len(title) < len(compare_title): title_to_remove = compare_title else: title_to_remove = title path_for_removal = song_path % (artist, title_to_remove) paths_for_removal.append(path_for_removal) print set(paths_for_removal) for path in list(set(paths_for_removal)): os.remove(path)

I didn’t realize it at the time, but there was one song that was a duplicate that didn’t even have remotely the same title. Barbed Wire Halo and Philippians 3:12-14 by Aaron Watson were the same song, but I would have had no idea if I had just used the title for removal of duplicates. Pretty happy that song was found. And with that, I now have a directory of lyrics that I’m confident have only one of each of the songs.

Step 4 — Trks

Now for the main event of this post, which country artist talks about trucks the most! Well I guess the main event was dealing with the duplicate songs, but now for the payoff here.

Here’s the code for finding average number of truck mentions per song that a singer has in their song arsenal.

import os artist_path = "artists" lyric_path = "artists/%s/lyrics" song_path = "artists/%s/lyrics/%s" keyword = "truck" artists = os.listdir(artist_path) artist_counts = [] for artist in artists: counts = {keyword: 0} artist_lyrics_path = lyric_path % artist song_titles = os.listdir(artist_lyrics_path) num_songs = len(song_titles) for song in song_titles: artist_lyrics_song_path = song_path % (artist, song) words = [word.lower() for line in open(artist_lyrics_song_path, 'r') for word in line.split()] for key in counts.keys(): for word in words: if word.lower() == key: counts[key] += 1 artist_counts.append((artist, counts[keyword]/float(num_songs))) for artist, val in sorted(artist_counts, key=lambda x: x[1], reverse=True): full_artist_name = ' '.join(artist.split('_')).title() print "%s: %s" % (full_artist_name, val)

Change the keyword from ‘truck’ to anything you’re trying to look at, and this snippet will spit out the average number of references to that keyword the artist has in their song library!

Without waiting any longer, here’s the list of trucks per song for the artists I have in my file:

Sam Hunt: 0.619047619048

Cole Swindell: 0.470588235294

Thomas Rhett: 0.46875

Lee Brice: 0.45652173913

Brantley Gilbert: 0.36

Jason Aldean: 0.266666666667

Luke Bryan: 0.243697478992

Justin Moore: 0.2

Florida Georgia Line: 0.171875

Jake Owen: 0.159420289855

Aaron Watson: 0.153153153153

Jon Pardi: 0.137931034483

Kip Moore: 0.135135135135

Keith Urban: 0.133333333333

Billy Currington: 0.130434782609

Randy Houser: 0.130434782609

Chris Young: 0.123076923077

Toby Keith: 0.115241635688

Tim Mcgraw: 0.113636363636

Dierks Bentley: 0.110169491525

Eric Church: 0.106666666667

Thompson Square: 0.103448275862

Eli Young Band: 0.0933333333333

Joe Nichols: 0.0877192982456

Garth Brooks: 0.0862068965517

Blake Shelton: 0.0857142857143

Trace Adkins: 0.0855263157895

Kellie Pickler: 0.08

Josh Turner: 0.078125

Alan Jackson: 0.0669144981413

Kenny Chesney: 0.0536585365854

Hunter Hayes: 0.0526315789474

Zac Brown Band: 0.0519480519481

Gary Allan: 0.0353982300885

Little Big Town: 0.0333333333333

Brad Paisley: 0.031746031746

George Strait: 0.0303867403315

Miranda Lambert: 0.0288461538462

Chris Stapleton: 0.0212765957447

Randy Travis: 0.0150943396226

Clint Black: 0.0141843971631

Reba Mcentire: 0.0139664804469

Shania Twain: 0.0119047619048

Brett Eldredge: 0.0

Brett Young: 0.0

Carrie Underwood: 0.0

Darius Rucker: 0.0

Jennifer Nettles: 0.0

Kacey Musgraves: 0.0

Kalie Shorr: 0.0

Lady Antebellum: 0.0

Maren Morris: 0.0

Martina Mcbride: 0.0

The Band Perry: 0.0

Big props to Sam Hunt for winning the award for most trucks per song (TPS)! Aided by the chorus in his song “Speakers” where there are two truck mentions in the chorus alone, meaning 6 trucks in that song.

Also interesting that doesn’t seem like the women artists sing about trucks all that much. Kellie Pickler wins the award for most trucks per song at a measly 0.08 TPS. I went through those mentions also, and of the 5 songs that she mentions trucks, only one is how she herself likes trucks, where the others are talking about the men in her life who own trucks.

And cause I know you want to know who sings about beer the most, Cole Swindell crushes the competition with a comical 0.94 mentions per song, a full 0.3 mentions more than the second place singer, Kip Moore. On a hunch, I tried ‘love’ as the keyword, and Cole Swindell came in second to last with only 0.559 mentions of love per song (Brett Young had 2.75 loves per song for reference). So I guess the moral of this article is Cole Swindell loves trucks and beer, but really hates love.

Tune in next time for the final article in this country series, classifying the songs according to their subjects, and my theory of the 4 subjects that a country song can be about!