Baseball is a sport rich in data and statistics. As a data geek, I wanted to analyze this data to see if I could make some discoveries around what drives baseball success. But despite this surplus of data, I found it difficult to fully understand, analyze and gain insight off of these data.

For me, baseball's data challenges fell into 3 main buckets:

(1) Data is pre-aggregated and summarized. While I was able to get answers to pre-conceived questions, I wasn't able to easily answer my own questions.

(2) Data is siloed and disconnected. It is easy to see a specific set of statistics (e.g. a player, a year, or a team), but difficult to compare multiple players across multiple teams across multiple eras.

(3) Tabular data with little visuals. Many sites allow you to sort and pivot, but few provide the ability to visualize and spots trends and outliers.

As a result of these challenges, a full understanding of baseball data is often left to a small few.

Enter Sean Lahman's Data Set

I stumbled on Sean Lahman's baseball archive. Sean provides a very robust data sets on every imaginable baseball statistic. He provides multiple data sets around teams, players, hitting, pitching, fielding, awards, parks, playoffs, all-star games, etc.

I decided to load this data into Qlik - to see if I could visualize and make new baseball discoveries off of this data. Here's what I found...

Baseball has a Data Modeling Issue

While most wouldn't view baseball as having a complex data model, it does indeed present many real-life data challenges that most organizations face. Sean's data set provides 28 different tables that do not neatly "join" together. There's a table for Teams with Teams, Years and Players. There's also a Batter table with Teams, Years, and Players. This same table is then replicated for Pitching and Fielding - and again for the Post Season, All Star Games, Hall of Famers, yearly Award winners, etc. Each of these tables creates many-to-many relationships between them. (e.g. many teams span many years and many players, many players play for many teams across many years, many all stars games are played across many years with many teams and players that participate, and on and on.) Furthermore, the Batting, Pitching, Fielding, and Post Season tables all have very similar fields across the entire table, like Games, Player, Team, Hits, Walks, etc. where a pitcher (in the NL) can pitch, field, and hit.

Anyways, without getting too techie, the data model ended up looking like this. What started as 28 different tables was merged into 15 distinct tables with 3 main fact tables.

What's the History of Baseball?

Now to the data. If we profile the data, we can easily see the following.

And if we trend out the average amount of games over these 146 years, we now see the following:

As a quick discovery, you can quickly see that before the 20th century, game statistics were not always properly entered and the number of games played was not equal from season to season. It wasn't until 1961 that baseball moved to a full 162 game season. And you can see a few dips in games due to strike shortened seasons, and during world war 1.

Which Team Wins the Most? Which Lose the Most?

This is not a straight forward question. For some, winning can be determined by most championships, it can mean most postseason appearance, it can mean most regular season wins, or it can mean the highest winning percentages (for those newer franchises). So here's all of these statistics in one view.

And then we can see the top 5 franchises by era.

While the above graphic isn't a huge surprise, there are a few key outliers. The visual on the top left says that the "As" have won 9 championships? That can't be right? But if we drill down, you can see that 5 of their championships were as the "Philadelphia Athletics".

If we just look at the last two eras (Longball from 1994-2005 and Post Steroids from 2006 onwards), we can see that during the Long Ball era, the Atlanta Braves won the most games, but only won one championship. The Yankees were second in Wins and won 4 championships. But in the post steroid era, the Yankees won the most, but did not win a championship. Also, while not shown in the graphic below, outside of the Florida Marlins and the Chicago Cubs, it is very rare for a team that is not dominant over an era to win a championship.

Similarly, we can pivot and look at the poorest performers. From this graphic, we can see that many of the poorest performers do not stay poor across multiple eras. You can see a few reversal of fortunes for some poor performing franchises, like the Kansas City Royals and Houston Astros.

What Are These Good Franchises Doing That's Better Than the Others?

One key offensive statistic used to gauge good hitters is OPS (On Base + Slugging). The idea is that the more often you get on base, the better the on-base percentage (OBP). And the more bases you get (e.g. a double is worth more than a single, a triple more than a double, and a homerun more than a triple), the better your slugging percentage (S). If you add together both of these statistics, you get OPS (On time + Slugging).

In the graphic below, over the past 146 years, we can see the following... (1) the teams that win the most score the most runs, (2) the most runs are scored by teams with a high OPS, and (3) a high OPS leads to more wins.

What about pitching? I thought that pitching wins games. Well, that's true as well. The two pitching statistics that I read about are WHIP and FIP. WHIP is the average of walk + hits per inning pitched. Similar to the hitting statistic, we can see that (1) teams that allow less runs win more, (2) teams with a low WHIP allow less runs, and (3) teams with a low WHIP win more.

So which is? Pitching or Hitting?

The graphic below shows the team's overall ranks for both OPS and WHIP. Light blue indicates if they made the playoffs, navy indicates if they won the world series, and gray means that they didn't make the postseason. While this shouldn't be a surprise, the top teams in the league are typical the top teams are: (1) the top teams in either OPS or WHIP or (2) in the top 15 in both OPS and WHIP. It's impossible for a below average team in both OPS and WHIP team to win a world series.

If I'm going to try to recruit good hitters with a good OPS and pitchers with a good WHIP, who's out there?

This is where the many-to-many joins come in. Teams have many players and play across many years and Players play for many teams across many years, and players play for multiple teams. If we look at the top hitters of all time, we can see that most are in the hall of fame - except the famous exceptions.

If we filter on players from 2010 onwards, we can get a list of the following players:

And then drill down to see their individual yearly statistics.

We can do the same exercise with pitching too. Based on pitching behavior, we can divide pitchers into "starters" and "relievers" based on whether they start games or not. If we filter for just the current era, we can get a list of pitchers with the lowest WHIP. RIP Jose Fernandez.

Are more teams "coaching" to advanced statistics?

Over the past 146 years, the average OPS has been increasing over time.

Does Salary Matter? Can You Buy a winner?

Finally, we can see how salary affects this. In this graphic below, we can see that there's a "loose" correlation between winning and salary. That is, the higher the salary, the more likely you are to win. We can also see that there are some very high payroll teams that don't win (bottom right in gray) and a few low payroll teams that win (top left in blue). So while it's not impossible, the odds are stacked against you.

What does all of this mean?

This is another example of how visualizations and analytics can be applied to better understand and to gain insight into your data.

Want to see the app? Click on the link: The Baseball Database App

https://qlikdemos.qlikpoc.com/anon/sense/app/0f0948dc-7912-4813-b947-b4d62fada1bf