SQL for Data Analysis – Tutorial for Beginners – ep3

Today I’ll show you the most essential SQL functions that you will use for finding the maximums or the minimums ( MAX , MIN ) in a data set and to calculate aggregates ( SUM , AVG , COUNT ). Then I’ll show you some intermediate SQL clauses ( ORDER BY , GROUP BY , DISTINCT ) that you have to know to efficiently use SQL for data analysis!

And this is going to be super exciting, as we will still use our 7M+ line data set!

Note: to get the most out of this article, you should not just read it, but actually do the coding part with me! So if you are on the phone, I suggest saving this article and continuing on a desktop!

Before we start…

…I recommend going through these articles first – if you haven’t done so yet:

SQL functions to aggregate data

Okay, let’s open SQL Workbench and connect to your data server!

Can you recall our base query?

It was:

SELECT * FROM flight_delays LIMIT 10;

And it returned the first 10 lines of this huge data set.

We are going to modify this query to get 5 interesting numbers:

exactly how many flights are in our table

the summary of the airtimes (note: practically speaking, airtime is the flight time) of these flights

the average arrival delays and the average departure delays

the maximum distance of any of these flights

the minimum distance of any of these flights

All of these are going to be very easy, I promise… But make sure you are doing the coding part with me, because 100% understanding of this basic stuff will be important in the long term.

SQL COUNT function to count lines

The easiest aggregation is to count lines in your table, and this is what the COUNT function is for. The only thing you have to change – compared to the above base query – is what you SELECT from your table. Remember? It can be everything ( * ), or it can be specific columns ( arrdelay , depdelay , etc) – and now let’s expand this list with functions. Copy this query into SQL Workbench and run it:

SELECT COUNT(*) FROM flight_delays LIMIT 10;

The result is: 7275288 .

The function itself is COUNT , and it says to count the lines using every column (*) … You can change the * to any columns’ name (eg. arrdelay ) – and you will get the very same number. Try this:

SELECT COUNT(arrdelay) FROM flight_delays LIMIT 10;

Right? Same result: 7275288 .

Note 1: This is only true when you don’t have NULL values (empty cells) in your table (we don’t have NULL values in the flight_delays data set at all). I’ll get back to the importance of NULL later.

Note 2: in fact you won’t need the LIMIT clause in this SQL query, as you will have only one line of data on your screen. But I figured that sometimes it might be better to keep it there, so even if you mistype anything, your SQL Workbench won’t freeze by accidentally trying to return 7M+ lines of data.

SQL SUM function to get summary

Now we want to get the airtime for all flights – added up. In other words: find the summary of column airtime . The SUM function works with the same logic as COUNT does. The only exception, that in this case you have to specify the column (in this case airtime ). Try this:

SELECT SUM(airtime) FROM flight_delays;

The total airtime is a massive 748015545 minutes.

SQL AVG function to get the mean

Our next challenge is to find the mean of the arrival delays and the mean of the departure delays… The function for that is AVG (refers to “average”) – functioning with the exact same logic as SUM and COUNT .

SELECT AVG(depdelay) FROM flight_delays;

Result: 11.36

SELECT AVG(arrdelay) FROM flight_delays;

Result: 10.19

Cool!

SQL MAX and MIN functions to get maximums and minimums

And finally let’s find the maximum and the minimum values of a given column. Finding the maximum and minimum distances for these flights sounds interesting enough… MIN and MAX operate just like SUM , AVG and COUNT did:

SELECT MIN(distance) FROM flight_delays;

Result: 11 miles. Well, maybe take the bike next time…

SELECT MAX(distance) FROM flight_delays;

Result: 4962

Okay! That was it – these are the basic SQL functions you have to know… It’s time to tweak them a little bit.

Basic segmentation analysis with SQL – aka. the GROUP BY clause

As a Data Analyst or Scientist you will probably do segmentations all the time. For instance, it’s interesting to know the average departure delay of all flights (we have just learned that it’s 11.36 ). But when it comes to business decisions, this number is not actionable at all. However, if we turn this information into a more useful format – let’s say we break it down by airport – it will instantly become something we can act on!

Here’s a simplified chart showing how SQL performs automatic segmentation based on column values:

The process has three important steps:

STEP 1 – Specify which columns you want to work with as an input. In our case we want to use the list of the airports ( origin column) and the departure delays ( depdelay column).

STEP 2 – Specify which column(s) we want to create our segmentation from. For us it’s the origin . SQL automatically looks for every unique value in this column (in the above example – airport 1, airport 2 and airport 3), then creates groups from them and sorts each line from your data table into the right group.

STEP 3 – Finally it calculates the averages using the SQL AVG function for each group and returns the results on your screen.

The only new thing here is the “grouping” at STEP 2. We have an SQL clause for that. It’s called GROUP BY . Let’s see it in action:

SELECT AVG(depdelay), origin FROM flight_delays GROUP BY origin;

If you scroll through the results, you will see that there are some airports with an average departure delay of more than 30 or even 40 minutes. From a business perspective it’s important to understand what’s going on at those airports. On the other hand it’s also worth taking a closer look at how the good airports ( depdelay close to 0) are managing to reach this ideal phase. (Yeah, it’s over-simplified, but just for example…)

But what just happened SQL-wise? We have selected two columns – origin and depdelay . origin has been used to create the segments ( GROUP BY origin ). depdelay has been used to calculate the averages of the arrival delays in these segments ( AVG(depdelay) ).

Note: As you can see, the logic of SQL is not as linear as it was in bash. If you write an SQL query, the first line of it could highly rely on the last line. When you’ll write really long and complex queries, this might cause some unexpected errors and thus of course a little headache too… But that’s why I find it very, very important to give yourself enough time to practice the basic things and make sure that you fully understand the relationships between the different clauses, functions and other stuff in SQL.

Test yourself #1

Here’s a little assignment to practice on and to double-check that you understand everything so far! The task is:

Print the total monthly airtime!

.

.

.

Ready?

Here’s my solution:

SELECT month, SUM(airtime) FROM flight_delays GROUP BY month;

I did pretty much the same stuff that I have done before, but now I’ve created the groups based on the months – and this time I had to use the SUM function.

Test yourself #2

And another exercise:

Calculate the average departure delay by airport again, but this time use only those flights that flew more than 2000 miles (you will find this info in the distance column).

.

.

.

Here’s the query:

SELECT AVG(depdelay), origin FROM flight_delays WHERE distance > 2000 GROUP BY origin;

The takeaway from this assignment is something that you might have already realized: you can use the SQL WHERE clause to filter even those columns that are not part of your SELECT statement.

SQL ORDER BY to sort by column(s)

Let’s say we want to see which airport was the busiest in 2007. You can get the number of departures by airport really easily using the COUNT function with GROUP BY clause:

SELECT COUNT(*), origin FROM flight_delays GROUP BY origin;

But this list is not sorted by default… To have that too, you need to add only one more SQL clause: ORDER BY . When you use it, you always have to specify which column you want to order the results by… It’s pretty straightforward.

SELECT COUNT(*), origin FROM flight_delays GROUP BY origin ORDER BY count;

Note: the column you will get after the COUNT function will be a new column… And it has to have a name – so SQL automatically names it “ count ” (check the latest screenshot above). When you refer to this column in your ORDER BY clause, you have to use this new name. I’ll get back to this in my next article in detail. If you find it weird, let’s try the same query but with ORDER BY origin – and you will understand it instantly.

Hm, almost there. But the problem is that the least busy airport is on the top – in other words, we got a list in ascending order. That’s the default for ORDER BY (in our postgreSQL database at least). But you can change this to descending order by simply adding the DESC keyword after it!

SELECT COUNT(*), origin FROM flight_delays GROUP BY origin ORDER BY count DESC;

Great!

DISTINCT to get unique values only

This is the last new thing for today. And this will be a very simple one. If you are curious how many different airports are in your table:

a) you can find it out using the GROUP BY clause. (Can you figure out how? :-))

b) you can find it out even more easily by using DISTINCT

DISTINCT basically removes all duplicates. Try this:

SELECT DISTINCT(origin) FROM flight_delays;

Now you have only unique airports!

By the way, the GROUP BY version would look like this:

SELECT origin FROM flight_delays GROUP BY origin;

Though result-wise it’s pretty much the same, the preferred way to do this is to use the DISTINCT – when making more complex queries, it will help you to keep your query simpler!

Test yourself #3

Today you have learned a ton of small but useful stuff. I’ll give you one more assignment that will summarize pretty much everything – even the previous two articles (ep 1 and ep 2). This is going to be a difficult one, but you can do it! If it doesn’t work at first, try to break it down into smaller tasks, then build and test your query till you get the result.

The task is:

List the:

top 5 planes (identified by the tailnum )

by the number of landings

at PHX or SEA airport

on Sundays

(eg. if the plane with the tailnumber 'N387SW' landed 3 times in PHX and 2 times in SEA in 2007 on any Sunday, then it has a total of 5. And we need the top 5 planes with the higher total.) Ready? Set! Go!

.

.

.

Done? Here’s my solution:

SELECT COUNT(*), tailnum FROM flight_delays WHERE dayofweek = 7 AND dest IN ('PHX', 'SEA') GROUP BY tailnum ORDER BY count DESC LIMIT 5;

And with comments:

SELECT –» select…

COUNT(*), –» this function counts the number of lines in a given group; to do that it needs the group by clause later

tailnum –» this will help to specify the groups (referred in the GROUP BY function later)

FROM flight_delays –» the name of the table

WHERE dayofweek = 7 –» a filter for Sundays only

AND dest IN ('PHX', 'SEA') –» filter for PHX and SEA destinations only

GROUP BY tailnum –» this is the clause that helps us to put the lines into different groups

ORDER BY count DESC –» and let’s order by the number of lines in a given group

LIMIT 5; –» list only the top 5 elements

Conclusion

And that’s it! You have learned a lot again today – SQL functions ( MIN , MAX , AVG , COUNT , SUM ) and new important SQL clauses ( DISTINCT , ORDER BY , GROUP BY ).

If you managed to get the last exercise done by yourself, I can tell you that you have a really good basic knowledge of SQL! Congrats! If not, don’t worry, just make sure that you re-read these first 3 chapters (ep 1, ep 2, ep 3), before you continue with episode 4!

I’ve created an online course that will take you from zero to intermediate level with SQL in 7 days. Go ahead and check it out here:

More info…

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)

Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,

Tomi Mester