Well, let’s face one of the most important assioma in this world: everyone loves Breaking Bad.

I must say, I found a lot of people which do not like it: well, ok, you all are right but I strongly believe that anyone with a lil’ bit of serious TV series background cannot deny the fact that Breaking Bad is one of the best series ever made. I’d like pointing out that I’m not a Breaking Bad fanboy (Scrubs FTW!) although I find myself agreeing to the last sentence. BB is precisely directed, everything has a specific place in the screenplay and there are no holes. Dialogues are never flat and each character has a particular and detailed role. The series is an intricate network of human nature, criminality and sin.

Why are we here? Well, suddenly, after a rewatch, the computer scientist soul which devourers myself came out from the hell and said erggggh, we niiidd to do someeeefhiing. I had to listen to it, so I started thinking about whether there was something interesting I could get out of BB by analyzing it.

Introduction (or WTF am I doing?)

In order to analyze BB I need to access and process its content, the show itself; it means that I should need to process the video, audio or script of it. Now, there are some issues for the first two source of information:

to process video, one of the most common approach would be the one of deep learning but I can’t afford to build a neural network and train it on the whole series;

but I can’t afford to build a neural network and train it on the whole series; to process audio, but I really do not know enough about in order to do that :(

Thus, I decided to go for the script way. Unfortunately, I didn’t find any script (or at least a complete one covering every episode) of the series, so I went for subtitles. Dozens of subtitles are available on the Internet and to find those for BB was an easy job: I downloaded them from here. For each episode, there are various version of the same subtitle but from different releasers. In a preprocessing step, I tried to select only the subtitles from the same releaser when possible.

In the following sections, I’ll briefly explain what I did, show you some cute graphs and try to come up with some explanation. This is the first part of a series of posts in which I try to analyze BB (and maybe other TV shows).

Obviously, I don’t claim any scientific significance here, it’s just for killing the cat!

I DON’T CARE, gimme da code!

All of the code I used in each preprocessing step is available on GitHub. The whole point of this post was to improve the expertise on some python libraries as matplotlib, scikit, etc. I don’t think I will clean the code neither make it fast or whatever for now, due to a lack of time.

Well, let’ start with this first part!

Distribution of talking time

I remember another series, Lost, in which the characters seem talking A LOT of time! They were used to explain everything happening on the isle and the most of time, everyone (including the viewer) understood NOTHING. So, I was curious to find the talking time of BB.

I started by mining the talking time for each episode by using the subtitles. Each row of a subtitle (which is a line or a part of a line for an actor) has a starting time point (STP) and an ending time point (ETP); having these values, the time length of the row is given by the difference between ETP and STP, thus the talking time for an episode is the sum of all row lengths for a subtitle.

For the running time of each episode, I simply scraped the web.