Machine Learning is an inherently data driven endeavour. The intent with ML is to have an algorithm which is able to process some data, and based on that date, be able to make some predictions. Even though there are a plethora of different ML algorithms which are designed for specific problems and are designed to work in different ways, they all have this one thing in common: They need data to work.

Even the best designed ML algorithms will not be able to perform well on a dataset that is poorly optimized. As I mentioned in one of my blogs earlier, the first step in designing a ML algorithm for a particular problem is to prepare the dataset itself. Most of the time, when designing ML solutions to problems, you’ll be given a huge dataset that you need to analyse the data, after which you can begin the process of preparing the dataset and extracting the relevant features from it.

This step is formally known as EDA (Exploratory Data Analysis).

In this blog I’ll be using a library called Pandas on python as a tool to analyse data. I figured that the best way to learn about Pandas is to use the library to analyse a dataset and answer some questions related to that dataset.

What is Pandas

So what do I mean when I say Pandas? Do I mean the cute adorable animal that are notoriously hard to breed in captivity? I wish. Pandas is actually a library made for python for data analysis and manipulation. It is widely used across the industry because it offers not only extremely robust tools for analysis, but is also very easy and simple to learn.

With pandas you can load data from different types of sources (such as csv files, json files, python dictionaries) into an object called a dataframe. You can think of this object as a self contained queryable data structure. What that means is that it is very easy to filter through and select data from the dataframe.

The Dataset

The dataset that we will be using is a flight records dataset. The dataset contains information about all outbound and inbound flights in the USA in the year 2008. You can download the dataset here.

We will be answering 20 questions that are related to this dataset that is provided by mlcourse.ai. They provide free online ML courses that are extremely suited to beginners. The current session lasts till 29th of April. You can register at any point till then.

The Questions

As mentioned before the questions are taken from mlcourse.ai, specifically from their most recent session. I will post all the questions here, but the solutions and code to solve these questions is posted on my Github repository on a Jupyter Notebook. You can access it here. You can directly open the Notebook into google colab and run the code on it.

The goal of this exercise is to learn by doing. The goal is to not specifically become an expert in pandas, but to learn how to analyse, answer and understand information and statistics from a dataset. I have however, in my github post, have detailed the intuition behind every question, shown the code to solve that question, and the intuition behind the code itself.

If you want an in depth tutorial on Pandas, i would highly recommend a series of tutorials by Sentdex, that can be found here.

Question 1

1. How many unique carriers are there in our dataset?

Question 2

2. We have both cancelled and completed flights in the dataset. Check if there are more completed or cancelled flights. What is the difference?

Question 3

3. Find a flight with the longest departure delays and a flight with the longest arrival delay. Do they have the same destination airport, and if yes, what is its code?

Question 4

4. Find the carrier that has the greatest number of cancelled flights.

Question 5

5. Let’s examine departure time and consider distribution by hour (column DepHour that we’ve created earlier). Which hour has the highest percentage of flights?

Question 6

6. OK, now let’s examine cancelled flight distribution by time. Which hour has the least percentage of cancelled flights?

Question 7

7. Is there any hour that didn’t have any cancelled flights at all? Check all that apply.

Question 8

8. Find the busiest hour, or in other words, the hour when the number of departed flights reaches its maximum.

Question 9

9. Since we know the departure hour, it might be interesting to examine the average delay for corresponding hour. Are there any cases, when the planes on average departed earlier than they should have done? And if yes, at what departure hours did it happen?

Question 10

10. Considering only the completed flights by the carrier, that you have found in Question 4, find the distribution of these flights by hour. At what time does the greatest number of its planes depart

Question 11

11. Find top-10 carriers in terms of the number of completed flights (UniqueCarriercolumn)?

Question 12

12. Plot distributions of flight cancellation reasons (CancellationCode).

What is the most frequent reason for flight cancellation?

Question 13

13. Which route is the most frequent, in terms of the number of flights?

Question 14

14. Find top-5 delayed routes (count how many times they were delayed on departure). From all flights on these 5 routes, count all flights with weather conditions contributing to a delay.

Question 15

5. Examine the hourly distribution of departure times. Choose all correct statements:

Flights are normally distributed within time interval [0-23] (Search for: Normal distribution, bell curve).

Flights are uniformly distributed within time interval [0-23].

In the period from 0 am to 4 am there are considerably less flights than from 7 pm to 8 pm.

Question 16

16. Show how the number of flights changes through time (on the daily/weekly/monthly basis) and interpret the findings.

Choose all correct statements:



The number of flights during weekends is less than during weekdays (working days).

The lowest number of flights is on Sunday.

There are less flights during winter than during summer.

Question 17

17. Examine the distribution of cancellation reasons with time. Make a bar plot of cancellation reasons aggregated by months.

Choose all correct statements:

October has the lowest number of cancellations due to weather.

The highest number of cancellations in September is due to Security reasons.

April’s top cancellation reason is carriers.

Flights cancellations due to National Air System are more frequent than those due to carriers.

Question 18

18. Which month has the greatest number of cancellations due to Carrier?

Question 19

19. Identify the carrier with the greatest number of cancellations due to carrier in the corresponding month from the previous question.

Question 20

20. Examine median arrival and departure delays (in time) by carrier. Which carrier has the lowest median delay time for both arrivals and departures? Leave only non-negative values of delay times (‘ArrDelay’, ‘DepDelay’). (Boxplots can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile).

That’s it for this blog. I hope you found it helpful and informative. If you have any comment, thoughts, suggestions and corrections, please comment below. Any and all feedback is greatly appreciated. Thank you for reading!





