I recently stumbled on a rather fun problem during my PhD. I wanted to generate all possible subqueries from a given SQL query. In this case an example is easily worth a 1000 thousand words. Take the following SQL query:

SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE p . customer_id = c . id AND p . shop_id = s . id AND c . nationality = ' Swedish ' AND c . hair = ' Blond ' AND s . city = ' Stockholm '

Here all the possible subqueries that can be generated from the above query.

-- 1 SELECT * FROM customers AS c WHERE c . nationality = ' Swedish ' -- 2 SELECT * FROM customers AS c WHERE c . hair = ' Blond ' -- 3 SELECT * FROM customers AS c WHERE c . nationality = ' Swedish ' AND c . hair = ' Blond ' -- 4 SELECT * FROM customers AS c , purchases AS p WHERE c . nationality = ' Swedish ' AND p . customer_id = c . id -- 5 SELECT * FROM customers AS c , purchases AS p WHERE c . hair = ' Blond ' AND p . customer_id = c . id -- 6 SELECT * FROM customers AS c , purchases AS p WHERE c . nationality = ' Swedish ' AND c . hair = ' Blond ' AND p . customer_id = c . id -- 7 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . nationality = ' Swedish ' AND p . customer_id = c . id AND p . shop_id = s . id -- 8 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . hair = ' Blond ' AND p . customer_id = c . id AND p . shop_id = s . id -- 9 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . nationality = ' Swedish ' AND c . hair = ' Blond ' AND p . customer_id = c . id AND p . shop_id = s . id -- 10 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . nationality = ' Swedish ' AND s . city = ' Stockholm ' AND p . customer_id = c . id AND p . shop_id = s . id -- 11 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . hair = ' Blond ' AND s . city = ' Stockholm ' AND p . customer_id = c . id AND p . shop_id = s . id -- 12 SELECT * FROM customers AS c , purchases AS p , shops AS s WHERE c . nationality = ' Swedish ' AND c . hair = ' Blond ' AND s . city = ' Stockholm ' AND p . customer_id = c . id AND p . shop_id = s . id -- 13 SELECT * FROM customers AS c , purchases AS p , shops AS s AND s . city = ' Stockholm ' AND p . customer_id = c . id AND p . shop_id = s . id -- 14 SELECT * FROM purchases AS p , shops AS s AND s . city = ' Stockholm ' AND p . shop_id = s . id -- 15 SELECT * FROM shops AS s AND s . city = ' Stockholm '

You might notice that I’ve omitted subqueries that don’t include a WHERE statement. Indeed I’m only interested in queries that filter on some attribute. Regardless, the number of possible subqueries is HUGE in comparison with the simplicity of the query (at least, in my opinion). I’m lazy so I didn’t take the time to find a nice formula that could tell me exactly how many subqueries to expect. However I wrote a Python script that enumerates all subqueries and saves to a JSON file. It took me some time to figure out so I thought it would be worth sharing.

I’ll go through the steps of my code and then give you the link to the research project I’m using it for on GitHub. The code is rather short and it doesn’t require any dependencies. However, the code is a bit of a mouthful and isn’t easy to understand without a little effort. Anyway, here goes.

I first wrote a function to parse an SQL query. This function, parse_query , returns a 3 dict s that contain the different part of the query.

import collections import re def remove_whitespaces ( string ) : """ remove_whitespaces( ' cake ' ) => ' cake ' """ return string . lstrip ( ) . rstrip ( ) def replace_between ( query ) : """ replace_between( ' BETWEEN 1 AND 3 ' ) => ' >= 1 AND <= 3 ' """ while ' BETWEEN ' in query : idx = query . find ( ' BETWEEN ' ) right_val = re . search ( " AND ([ ' a-zA-Z0-9]+)? " , query [ idx : ] ) . group ( ) . split ( ' ' ) [ 1 ] att = remove_whitespaces ( query [ : idx ] . split ( ' AND ' ) [ - 1 ] ) query = query [ : idx ] + re . sub ( r ' AND .+? \ s ' , f ' AND {att} <= {right_val} ' , query [ idx : ] , 1 ) query = query . replace ( ' BETWEEN ' , ' >= ' , 1 ) return query def parse_query ( query ) : query = query . replace ( '

' , ' ' ) from_part = query . split ( ' FROM ' ) [ 1 ] . split ( ' WHERE ' ) [ 0 ] where_part = query . split ( ' WHERE ' ) [ 1 ] where_part = replace_between ( where_part ) froms = dict ( reversed ( list ( map ( remove_whitespaces , name . split ( ' AS ' ) ) ) ) for name in from_part . split ( ' , ' ) ) joins = { } wheres = collections . defaultdict ( list ) # Parse the WHERE statement for where in list ( map ( remove_whitespaces , re . split ( r ' AND \ s*(?![^()]* \ )) ' , where_part ) ) ) : if ' OR ' in where : rel = where . split ( ' . ' ) [ 0 ] [ 1 : ] wheres [ rel ] . append ( where ) continue op = re . search ( ' | ' . join ( f ' ({op}) ' for op in OPERATORS ) , where ) . group ( ) left , right = where . split ( op ) left = remove_whitespaces ( left ) right = remove_whitespaces ( right ) op = remove_whitespaces ( op ) if op == ' = ' and ' . ' in right : l_rel = left . split ( ' . ' ) [ 0 ] r_rel = right . split ( ' . ' ) [ 0 ] joins [ tuple ( sorted ( [ l_rel , r_rel ] ) ) ] = where continue rel = left . split ( ' . ' ) [ 0 ] wheres [ rel ] . append ( where ) return froms , wheres , joins

I know, it’s not very pretty. But you can at least understand the outputs. froms is a dict where the keys are the relation aliases and the values are the relation names.

{ ' c ' : ' customers ' , ' p ' : ' purchases ' , ' s ' : ' shops ' }

As for wheres , it lists the WHERE clauses of each relation.

{ ' c ' : [ " c.nationality = ' Swedish ' " , " c.hair = ' Blond ' " ] , ' s ' : [ " s.city = ' Stockholm ' " ] }

Finally, joins maps pairs of relations to their associated join clause.

{ ( ' c ' , ' p ' ) : ' p.customer_id = c.id ' , ( ' p ' , ' s ' ) : ' p.shop_id = s.id ' }

The parse_query function is the weakest part of my script. It relies on the fact that the provided query has a certain structure that I won’t dwell on. Basically each relation must have an alias (e.g. shops AS s ) and the WHERE conditions should all be separated by AND s (not OR s). I tailored this to work for the TPC-DS benchmark queries. If you wanted to reuse my script for other types of queries, then I believe that you would mostly have to tinker with parse_query .

The second main function I wrote is called yield_subqueries . It’s a generator that will return each subquery one by one.

def powerset ( iterable ) : """ powerset([1,2,3]) => (1,) (2,) (3,) (1, 2) (1, 3) (2, 3) (1, 2, 3) """ s = list ( iterable ) return itertools . chain . from_iterable ( itertools . combinations ( s , r ) for r in range ( 1 , len ( s ) + 1 ) ) def yield_subqueries ( query ) : froms , wheres , joins = parse_query ( query ) relations = list ( sorted ( froms . keys ( ) ) ) adjacencies = [ [ 0 for _ in range ( len ( relations ) ) ] for _ in range ( len ( relations ) ) ] for pair in joins . keys ( ) : i , j = relations . index ( pair [ 0 ] ) , relations . index ( pair [ 1 ] ) adjacencies [ i ] [ j ] = adjacencies [ j ] [ i ] = 1 combos = set ( ) def enum ( i , combo , seen ) : candidates = [ ] for j , ok in enumerate ( adjacencies [ i ] ) : if j not in seen and ok : candidates . append ( j ) for pick in powerset ( candidates ) : yield combo + [ relations [ p ] for p in pick ] for p in pick : for new_combo in enum ( p , combo + [ relations [ p ] for p in pick ] , { * seen , * candidates } ) : yield new_combo for i , rel in enumerate ( relations ) : combos . add ( ( rel , ) ) for combo in enum ( i , [ rel ] , { i } ) : combos . add ( tuple ( sorted ( combo ) ) ) for combo in combos : relevant_joins = list ( filter ( None . __ne__ , [ joins . get ( tuple ( sorted ( pair ) ) ) for pair in itertools . combinations ( combo , 2 ) ] ) ) relevant_wheres = list ( itertools . chain ( * [ wheres [ r ] for r in combo ] ) ) from_stmt = ' , ' . join ( f ' {froms[rel]} AS {rel} ' for rel in combo ) join_stmt = ' AND ' . join ( relevant_joins ) sql = f ' SELECT * FROM {from_stmt} ' if join_stmt : sql + = ' AND ' + join_stmt for where_combo in powerset ( relevant_wheres ) : subquery = ( sql + ' AND ' + ' AND ' . join ( where_combo ) ) . replace ( ' AND ' , ' WHERE ' , 1 ) subquery = subquery = re . sub ( ' \ s+ ' , ' ' , subquery ) . strip ( ) yield subquery

Again, it’s hard to understand but it’s worked really well for me and is quite fast. The idea is to build an adjacency matrix which tells us which relations each relation is connected to in the query. In our case this would be:

[ [ 0 , 1 , 0 ] , # customers [ 1 , 0 , 1 ] , # purchases [ 0 , 1 , 0 ] # shops ]

The idea is to then loop over each row and “walk” the adjacency matrix until having enumerated all possible join combinations. For each join combination we’re also going to enumerate all the possible WHERE statements. These are called relevant_joins and relevant_wheres . combos is a set that contains all the possible join combinations. I know, it’s a bit of a mouthful.

Inside the main() of my script I simply looped over the TPC-DS benchmark queries and saved all the subqueries I could enumerate.

import glob import json def main ( ) : results = [ ] for path in sorted ( glob . glob ( ' join-order-benchmark/*.sql ' ) ) : query_name = os . path . basename ( path ) . split ( " . " ) [ 0 ] if query_name in ( ' fkindexes ' , ' schema ' ) : continue # Load and tidy the query query = open ( path ) . read ( ) . rstrip ( ) . rstrip ( ' ; ' ) # Enumerate the subqueries and save them for i , subquery in enumerate ( yield_subqueries ( query ) ) : print ( subquery ) results . append ( { ' mother_query ' : query_name , ' sql ' : subquery } ) print ( f ' Enumerated {i + 1} subqueries for query {query_name} ' ) break with open ( ' results.json ' , ' w ' ) as f : json . dump ( results , f )

The TPC-DS benchmark contains 113 queries. My script produces a total of 5,122,790 queries in under 4 minutes. Here’s the breakdown per query:

Query 10a: 477 Query 10b: 225 Query 10c: 140 Query 11a: 1842 Query 11b: 1842 Query 11c: 1471 Query 11d: 869 Query 12a: 1730 Query 12b: 922 Query 12c: 1730 Query 13a: 261 Query 13b: 981 Query 13c: 981 Query 13d: 261 Query 14a: 760 Query 14b: 1424 Query 14c: 1032 Query 15a: 4724 Query 15b: 12946 Query 15c: 2504 Query 15d: 596 Query 16a: 258 Query 16b: 54 Query 16c: 122 Query 16d: 258 Query 17a: 63 Query 17b: 38 Query 17c: 38 Query 17d: 38 Query 17e: 38 Query 17f: 38 Query 18a: 241 Query 18b: 3678 Query 18c: 279 Query 19a: 64218 Query 19b: 115806 Query 19c: 12945 Query 19d: 2393 Query 1a: 88 Query 1b: 124 Query 1c: 68 Query 1d: 68 Query 20a: 1063 Query 20b: 1504 Query 20c: 1063 Query 21a: 5598 Query 21b: 5598 Query 21c: 5598 Query 22a: 10337 Query 22b: 10337 Query 22c: 10337 Query 22d: 2993 Query 23a: 5586 Query 23b: 3289 Query 23c: 5586 Query 24a: 33458 Query 24b: 69108 Query 25a: 852 Query 25b: 2340 Query 25c: 852 Query 26a: 4400 Query 26b: 4400 Query 26c: 2608 Query 27a: 28616 Query 27b: 16904 Query 27c: 28616 Query 28a: 41839 Query 28b: 41839 Query 28c: 41839 Query 29a: 2245875 Query 29b: 1302387 Query 29c: 832050 Query 2a: 16 Query 2b: 16 Query 2c: 16 Query 2d: 16 Query 30a: 7302 Query 30b: 12550 Query 30c: 4678 Query 31a: 2421 Query 31b: 11141 Query 31c: 1908 Query 32a: 7 Query 32b: 7 Query 33a: 12339 Query 33b: 6609 Query 33c: 12339 Query 3a: 25 Query 3b: 25 Query 3c: 25 Query 4a: 68 Query 4b: 68 Query 4c: 68 Query 5a: 178 Query 5b: 358 Query 5c: 178 Query 6a: 32 Query 6b: 32 Query 6c: 32 Query 6d: 32 Query 6e: 32 Query 6f: 19 Query 7a: 3901 Query 7b: 2137 Query 7c: 6547 Query 8a: 1540 Query 8b: 13679 Query 8c: 52 Query 8d: 52 Query 9a: 7771 Query 9b: 7771 Query 9c: 874 Query 9d: 486

I’m sure it’s possible to do this faster and with more obvious code, but this is as could I judged necessary with respect to the rest of my PhD work. The full script is available in this GitHub repository in the scripts directory.

I hope someday this helps someone!