Simplify your SQL Queries with Common Table Expressions (CTEs)

June 01, 2019

Writing SQL in a professional setting is never as simple as the tutorials make it out to be. You’ll reference several tables, each of which may require custom business logic in order to make them usable. To keep your code readable to yourself and others, consider using Common Table Expressions, or CTEs, to avoid packing too much logic into subqueries.

A CTE is a query whose result set you can reference in a later section of your query. It’s sort of like a view that’s local to the query that it lives in. Instead of writing a large subquery and having your logic nested within an outer query, you can write the subquery as a common table expression that a further section of your query can reference. The result is that the overall query is more modular and significantly easier to edit and understand.

The syntax of a CTE is simple:

WITH my_cte_name AS (SELECT * FROM EXAMPLE)

After you write this CTE, you can reference its result set further down in the query by selecting from result set my_cte_name as if it were a table in the database. Here’s a more concrete example showing how a complex query can be easier to read, write, and maintain using CTEs.

Example with Real Data

Here I’ll use the Stack Overflow dataset from Google’s public BigQuery data. My goal is to find Stack Overflow users who are experienced in machine learning in Python, which I’ll define as the subset of users who have received upvotes on answers they’ve submitted to questions tagged with both python and machine-learning . I’ll want to know how many of these comments they’ve submitted as a proxy for how skilled or involved in the ML community they are, and I’ll also want to know some general info about who they are (name, location, etc.).

There are three tables needed to answer this. stackoverflow_posts contains every post, its unique id, and tags identifying the subject matter. comments has each comment id, the commenter’s user id, the number of upvotes it received, and the id of the Stack Overflow question it belogs to. Last, users has user metadata saying who the user is.

Solving This with CTEs

This question can be broken into three questions, which will help to organize it into common table expressions.

1: which posts are asking questions about machine learning in Python?

2: which comments belong to these ML posts, and which of these comments received at least one upvote?

3: who are the users who submitted upvoted comments to these machine learning posts, and how many of these successful comments have they submitted?

I’ll answer the first question with a CTE. The results from the first CTE will then allow me to answer the second question by querying the comments table in a second CTE. Finally, I can inner join the second CTE onto the users table to answer my initial question of which users are most successful in answering Stack Overflow questions about Machine Learning in Python.

# StandardSQL -- find posts about machine learning in python WITH python_ml_posts AS ( SELECT id FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` WHERE tags LIKE '%machine-learning%' AND tags LIKE '%python%' ), -- find comments on these ML posts with >0 upvotes and the commenter's user id ml_post_comments AS ( SELECT user_id , count ( * ) AS comment_count FROM `bigquery-public-data.stackoverflow.comments` comments INNER JOIN python_ml_posts ON python_ml_posts . id = comments . post_id WHERE score > 0 GROUP BY user_id ) -- connect users with their machine-learning-comment-count SELECT users . display_name , users . location , users . reputation , ml_post_comments . comment_count FROM `bigquery-public-data.stackoverflow.users` users INNER JOIN ml_post_comments ON ml_post_comments . user_id = users . id ORDER BY comment_count desc

Row display_name location reputation comment_count 1 Fred Foo 281432 30 2 Andreas Mueller New York, United States 14857 21 3 lejlot London, United Kingdom 47420 20 4 EdChum Berkshire, United Kingdom 178578 17 5 ogrisel Paris, France 29009 13

This is a fairly complicated question, but as you can see from the above query, it’s easy to read and write when you tackle it with CTEs. The alternative would be to query the users table, then join on the comments table with a subquery for finding each user’s comments, which in turn would need a second subquery for finding which of these comments were related to machine learning in Python. Gross! Don’t do that. Use common table expressions instead. Your future self and whoever you share a code base with will thank you.