

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”7″ chap=”Chapter 27(b) from “beta” Volume VII”]

As it was already noted – execution plans are the cornerstone of database query optimization; however – surprisingly few people out there realize how they work. On the other hand – IMO the basic understanding of “how this execution plan will perform”, and “what is the ideal execution plan we’d like to get” is very far from any black magic and is pretty easy to achieve, as long as we’re familiar with basic data structures (discussed at length in [Knuth-Vol1] and [Knuth-Vol3]), more specifically – trees, lists, and hash tables. For the purposes of this Chapter, we’ll use C++ and Java to illustrate these structures; however – this understanding is not language-specific and is actually all about big-O complexities of basic operations over trees, lists and hash tables.

That being said, it is important to realize inherent limitations of such analysis:

This kind of analysis is a big-O analysis, which means that it is inherently approximate. It may easily happen that whatever-execution-plan-we-devised-along-the-lines-discussed-here, is inferior to whatever-your-SQL-compiler-generates. Still, IF we see that your SQL-compiler’s-version is orders of magnitude worse (~=”has worse big-O asymptotic”) than our-own-execution-plan – it is a very strong reason to start questioning the optimality of the plan currently generated by our SQL-compiler’s.

“ When we have that ideal-execution-plan – we still need to convince our SQL compiler to use it Most of the time 1 – it is not a problem; as soon as we create an index which is necessary to enable that ideal-execution-plan – most of decent SQL compilers tend to use it and get pretty close to the ideal. However, pretty often it happens that after creating the index, SQL compiler still doesn’t want to go for our ideal execution plan. It means one of two things: Our plan is not really ideal, or We need to use some kind of RDBMS-dependent black magic to convince SQL compiler to kill his darling execution plan, and to make it use an apparently-better-our-one. This process is highly dependent on a specific DB you have, and can use anything from officially-supported optimizer hints to outright black magic such as “adding OR 1=0 to WHERE clause” 2 (and if you’re on PostgreSQL, you’re out of luck at least for the foreseeable future 3 ).



Approximating Execution Plans using Familiar Terms

“Let’s note that all the examples below are NOT “a thing which we’ll write to manipulate DB”, and even NOT “how DB works internally”The whole point of this exercise is to take execution-plan primitives (those which we discussed above) and to express them in the terms-which-are-familiar-to-us (as in “us=non-DB developers”). Let’s note that all the examples below are NOT “a thing which we’ll write to manipulate DB”, and even NOT “how DB works internally”; instead – they are “how we can approximate behavior of execution plans performance-wise, using terms-which-are-familiar to-non-DB-developers to understand execution plans better”. Sure, as any approximation – this one will be, well, rather approximate <wink />; on the other hand – it has been seen to provide extremely good insights for optimizing SQL queries (we’ll see a very practical example of it a bit later).

Let’s name all our approximations made in this section, “big-O approximations of execution plans”.

Approximating Tables and Indexes

For our purposes – we can approximate each table as a single-linked (and actually unordered, with list order potentially changing on each run(!)) list consisting of rows.

And each B-tree index can be approximated as a tree-based map, mapping index key into pointers to these rows-sitting-within-the-list. As for hash indexes – they can be approximated as similar hash-based maps.

It can be summarized as the following table:

RDBMS C++ big-O approximation Java big-O approximation Big-O approximation using terms from Knuth Table std::list<Row>4 List<Row> Single-linked list5 B-tree Index std::map<Key,Row*> TreeMap<Key,Row> Tree (with added pointers to the next node, to facilitate O(1) iterations) Hash Index std::unordered_map<Key,Row*> HashMap<Key,Row> Hash table

Approximating Scans

With this in mind – we can proceed with approximating Table Scans and Index Scans:

RDBMS C++ big-O approximation Java big-O approximation Big-O approximation using terms from Knuth Vol. I Complexity Table Scan Iteration over std::list<> Iteration over List<> Iteration over list O(N) Index Search / Unique Index Scan (B-tree Index) std:: map<>:: find() TreeMap.get() Search in tree O(log(N)) Index Search / Unique Index Scan (Hash Index) std:: unordered_map<>:: find() HashMap.get() Search in hash table O(1) Range Index Scan (B-tree Index)6 Iteration over std::map<>7 Iteration over TreeMap Traversing tree O(N)8 Range Index Scan with LIMIT/FETCH FIRST/SELECT TOP/… (B-tree Index) Iteration over std::map<> with early exit by the number of items specified in LIMIT etc. Iteration over TreeMap with early exit Iteration over tree with early exit O(log(N))+

min(O(M),L)

=O(log(N)+O(1)

=O(log(N))9

To make it more obvious – let’s write down our approximation of the Range-Index-Scan-with-LIMIT in C++:10

using Index std::map<Key,Row*>; ResultSet<Row*> RangeIndexScanWithLimit( Index& index, Key rangeBegin, Key rangeEnd, size_t limit ) { ResultSet<Row*> ret; size_t n = 0; for( Index::iterator it=index.find(rangeBegin), n <= limit && it != index.end() && (*it).first <= rangeEnd, ++it, ++n ) { ret.append( (*it).second ); } return ret; }

There isn’t much to explain here: we’re just finding the boundary of the range, then iterate through our std::map<> and then simply break our search if we run out of the range, OR if we reach pre-defined limit.

The same thing expressed in Java terms, would look very similar:

public class Index extends TreeMap<Key,Row> {} ResultSet<Row> RangeIndexScanWithLimit( Index index, Key rangeBegin, Key rangeEnd, int limit ) { ResultSet<Row> ret = new ResultSet<Row>(); int n = 0; for( Map.Entry<Key,Row> entry : index.subMap(rangeBegin,rangeEnd) ) { if( n++ >= limit ) break; ret.append( entry.getValue() ); } return ret; }

And if speaking in terms of Knuth, it is still about the same:

We have to find one boundary of the range (which is an O(log(N) operation)

Then we have to traverse our tree in the order of our key, starting from found range boundary Traversing has to last until either we run into the end of the range (which is O(N)) – or until we got all our LIMIT rows (L); it means that such traversing is a min(O(M),L) operation



Approximating Joins

After we described our approximations for tables, indexes, and scans – the only thing we need to understand to proceed further, is how to approximate execution-plan-level joins (Nested Loop Join, Sorted Merge Join, and Hash Join). Apparently, approximating them is not a rocket science too.

Nested Loop Join

In our C++ big-O approximation, Nested Loop Join (a.k.a. NL Join) would look along the following lines:

ResultSet<Row*> NestedLoopJoin( ResultSet<Row2*> &outer, SearchableDataSet<Row*>& inner) { //SerchableDataSet allows to perform searches within; // in practice – usually it is either Index Scan, or // Table Scan, or another ResultSet<> ResultSet<Row*> ret; for( auto outeritem: outer ) { Key key = outeritem->key; Row* found = inner.find(key); if( found ) ret.append(found); } return ret; }

In Java, the same thing would look similar to the following:

ResultSet<Row> NestedLoopJoin( ResultSet<Row2> outer, SearchableDataSet<Row> inner ) { ResultSet<Row> ret = new ResultSet<Row>(); for( OuterEntry outeritem : outer.entrySet() ) { Key key = outeritem.getKey(); Row found = inner.get(key); if( found != null ) ret.append(found); } return ret; }

And in terms of our Knuth-based big-O approximation, we can describe the same thing as follows:

We’re going along the outer data set (which actually can be any kind of collection), iterating through its items

For each such item, we’re extracting an item – and get an item key out of it (in terms of RDBMS – the key is usually 11 one of the sides of equation in JOIN … ON statement)

one of the sides of equation in JOIN … ON statement) For each such item key – we’re performing a search within the second data set to find if there is a match.

Whatever of the C++/Java/[Knuth] approximation we’re using – the performance-wise result is still the same: we have to make N (where N is number of items in the outer data set) searches over the inner data set. As a result – if the inner data set is actually a Table Scan – we’re speaking about the cost of NL Join being O(N*M)12 <crazy! />, and if the inner data set is an Index Scan – it is more like O(N*log(M)).

Sort Merge Join

When speaking of Sort Merge Joins – they’re also simple; the most important thing is that we’re assuming that both input data sets for Sort Merge Join are already sorted. Then, in our C++ approximation, Sort Merge Join will look along the following lines:

SortedResultSet<Row*> SortMergeJoin( SortedResultSet<Row2*> &a, SortedResultSet<Row*>& b) { SortedResultSet<Row*> ret; auto ait = a.begin(); auto bit = b.begin(); while( ait != a.end() && bit != b.end() ) { if( (*ait)->key < (*bit)->key ) ++ait; else if( (*ait)->key > (*bit)->key ) ++bit; else { ret.append( *bit.second ); ++ait; ++bit; } } return ret; }

In Java, it would look more or less as follows:

SortedResultSet<Row> SortMergeJoin( SortedResultSet<Row2> a, SortedResultSet<Row> b ) { SortedResultSet<Row> ret = new SortedResultSet<Row>(); SortedResultSetIterator<Key,Row2> ait = a.iterator(); SortedResultSetIterator<Key,Row> bit = b.iterator(); if( !ait.hasNext() || !bit.hasNext() ) return ret; Map.Entry<Key,Row> lasta = ait.next(); Map.Entry<Key,Row> lastb = bit.next(); while( ait.hasNext() && bit.hasNext() ) { int cmp = lasta.getKey().compare(lastb.getKey()); if( cmp < 0 ) lasta = ait.next(); else if( cmp > 0 ) lastb = bit.next(); else { ret.append( lastb.getValue() ); lasta = ait.next(); lastb = bit.next(); } } return ret; }

And in terms of our Knuth-based big-O approximation (actually, it is “Merge Sort” from Vol. 3), it can be expressed along the following lines:

We’re going over two sorted lists, and compare the values at current positions within the lists

If current value within the first list is smaller – we’re advancing in the first list, if current value in the second list is smaller – we’re advancing in the second list.

If values in both lists are equal – we found the match, so it belongs to our output result set

Hash Join

With Hash Join – the situation is a bit more interesting. Here, we’ll be using a temporary in-memory structure to handle the join. C++ approximation of a Hash Join may look along the following lines:

ResultSet<Row*> HashJoin( ResultSet<Row2*> &a, ResultSet<Row*>& b) { //preparing temporary hash table std::unordered_map<Key,Row*> tmp; for( auto itema: a ) tmp.insert(itema->key,itema); ResultSet<Row*> ret; for( auto itemb: b ) { auto found = tmp.find( itemb->key ); if( found != tmp.end() ) ret.append( found.second ); } return ret; }

A similar thing in Java would look somewhat similar to the following:

ResultSet<Row> HashJoin( ResultSet<Row2> a, ResultSet<Row> b ) { //preparing temporary hash table HashMap<Key,Row> tmp = new HashMap<Key,Row>(); for( Map.Entry<Key,Row2> itema : a.entrySet() ) tmp.put(itema.getKey(),itema.getValue()); ResultSet<Row> ret = new ResultSet<Row>(); for(Map.Entry<Key,Row> itemb : b.entrySet()) { Row found = tmp.get( itemb.getKey()); if( found != null ) ret.append( found ); } return ret; }

And for our Knuth-inspired big-O approximation, we can say that:

We convert one of the data sets into a temporary in-memory mapping (usually hash-based one)

Then, iterating over another data set – we’re making a very quick search within this temporary mapping

Most importantly, regardless of the way we represent it – we can see that for a Hash Join the following observations stand:

It makes sense ONLY if one of the source data sets is small enough to fit in RAM

As long as the above stands, Hash Join is not that bad: as search in a temporary hash map is O(1), we can get O(N) complexity for the whole Hash Join (where N is a number of items in the larger set)

A Very Practical Real-World Example

Now, with all the items in our vocabulary defined, let’s see how this approach can be applied to a very practical example from the real world.13

“In pretty much all kinds of operational databases, there is an extremely common request: “Gimme last 100 games by this particular player”In pretty much all kinds of operational databases, there is an extremely common request: “Gimme last 100 games by this particular player”. For games – the need of such a request usually follows from the need to show most recent games to the player. However – it is certainly NOT restricted to games: in particular, for a shopping site it can be “last 20 orders to show on the first ‘last orders’ page”, and for a bank it can be “last 50 account entries to show”. NB: let’s note that the final result of our exercise – while being unknown at least to 90% of DBAs out there – is certainly NOT novel. What is interesting, however – is how we can use analogies from C++/Java/Knuth to arrive to this not-so-trivial end result.

Once upon a time, there was a system which needed to perform this “last 100 games for a particular player” request some thousands of time per day. Pretty soon – it became a significant bottleneck. And while in the appropriate GAMES table there were indexes both by TIME and by PLAYERID – it didn’t really work fast. Analysis of execution plans has revealed that optimizer could use either index by TIME, or the one by PLAYERID, but none of them was good:14

If using index by TIME with LIMIT – the system would need to scan all the index entries by all the players until 100 rows for this particular player was reached. So, if our player played little enough to have his 100 last games to go into last year – we were in a Deep Trouble™. With this index – the complexity of the execution plan effectively was O(Nplayers).

If using index by PLAYERID – LIMIT wouldn’t work (we won’t have any sorting after the index by mere PLAYERID), so if the player in question was an active one (with tens of thousands games overall) – then it also caused significant slowdowns. With such an index by PLAYERID, the complexity of the execution plan effectively was O(Ngames-for-specific-player).

At this point – we started to think “what can be improved here?”. And the most important thing-we-have-done – was starting to think:

If I was a compiler-with-ability-to-create-indexes-as-I-need-it – what would I do to make The Ideal Execution Plan™ for this request?

Keeping the primitives above in mind, we came up with the following Knuth-inspired approximation:

We’d use some kind of index to jump to the very last-of-the-games-for-this-specific-player

Then, we’d scan this index (ONLY by this-specific-player) backwards for just 100 records.

Bingo! The cost would be only reading 100 rows, which is going to be pretty fast (in fact – for a sizeable table with lots of different players it is ideally fast, you cannot possibly beat this approach). From a Big-O point of view – the complexity of this approach had an upper bound of O(100)=O(1) <sic! />.

If we’d use C++ big-O approximations at that time, our Ideal Execution Plan™ would look as follows:

ResultSet<Row*> last100Games( int playerID, int limit, std::map<Key,Row*>& index ) { ResultSet<Row*> ret; size_t n = 0; for( Index::iterator it=index.find_last_by_time(playerID), n <= limit && it != index.end(), --it, ++n ) { ret.append( (*it).second ); } return ret; }

And in Java, it would look similar to the following:

ResultSet<Row> last100Games( int playerID, int limit, TreeMap<Key,Row> index ) { ResultSet<Row> ret = new ResultSet<Row>(); int n = 0; for( Map.Entry<Key,Row> entry : index.tailMap_by_time(playerID) ) { //to work, index should have DESCENDING sort order if( n++ >= limit ) break; ret.append( entry.getValue() ); } return ret; }

Ok, with this nice Ideal Execution Plan in mind, now there was a second question: whether such-an-index-we-need-to-use-here, exists? Apparently – it does; in fact – if we make our Key as a tuple of (PlayerID,TIME) – the whole thing will work perfectly as designed.15 The trick here is that as soon as we have a set which is ordered by a tuple (with a quick search by that tuple available) – we can easily and extremely efficiently work with any of the subsets-of-this-set where subset is defined by first X fields of the tuple (in our case – it is a subset defined by a single field PlayerID). Most importantly – this is NOT a property which is unique to B-trees and RDBMS indexes; rather – it is a common property of ALL the ordered sets, including iterations over tree-based structures (including RDBMS B-trees, Knuth trees, C++ std::map<>, and Java’s TreeMap).

With this in mind – we thought that

Hey, if we add a composite index by (PLAYERID,TIME) DESCENDING, 16 it should allow SQL compiler to use our Ideal-Execution-Plan, improving things for this particular request greatly.

In that real-world case I’m speaking about, the trickiest part was to convince our database developers that such an index can help. However – after we managed to push it through (and after waiting for half a day for the index to be created over that-table-with-a-hundred-of-millions-rows) – SQL compiler DID pick the index up, and performance improved between 10x and 100x(!).

And if we think about it – it is not surprising; given-the-data-which-was-stored-in-the-table-at-that-time – it is exactly the kind of gains one could expect from using such an index, for this type of request.

BTW, if we take a closer look at it – we’ll see that it is almost-equivalent to our approximated Ranged-Scan-With-LIMIT (and it was actually the execution plan used by complier after it has been provided with this appropriate composite index).

Different incarnations of the very same problem

If you think that all the stuff above is very obvious – good for you (and even better for your employer <wink />) but I can assure you that there are LOTS of people who don’t realize this kind of things.

Actually, this particular problem with such a composite index by (CUSTOMERID, TIME) missing (or unused), is very common in the industry; I’ve seen its manifestations much more than I’d like to. The most obvious one (in addition to the example above) was a case when a payment processor, to recover from “transaction is in unknown state”, provided an API which allowed merchant to request merchant’s transactions by MERCHANTID;17 however – this request was Really Slow™, so they had to restrict such requests to at-most-2-hours-of-transactions-at-a-time (and such a request took like 10 minutes to get reply from them back); as you can imagine – implementing a recovery which worked correctly if their site was down for more-than-2-hours – was quite involving (and for no real reason too18). TBH, I didn’t have a chance to see their DB, but with everything-I-know, I can be 99+% sure that their problem was that they didn’t have that composite index by (CUSTOMERID, TIME) (which, as we have seen, can be more-or-less easily deduced from the Ideal Execution Plan™ within the approximation paradigm above).

TL;DR for Approximations of Execution Plans

Phew; I hope that I managed to convince you that big-O approximations such as the one above, MAY be a Really Good Thing™. On the other hand, I am the first one to admit that they’re not perfect:

Essentially, such approximations provide only big-O-like analysis; it means that they’re very VERY approximate to start with. On the other hand – if your big-O analysis (for example, using approximations as discussed above) shows that your current execution plan can be improved – we can be easily speaking about performance gains of the orders of magnitude. This in turn means, that for every query which is problematic, such a big-O analysis should be the very first step when trying to optimize. Sure, there are many more other things which can be taken into account when optimizing (such as including non-key fields into the index) – but they tend to give much smaller gains than optimizations-which-improve-big-O-behavior. Even after we have this Ideal-Execution-Plan and added all the necessary indexes – chances are that our specific almighty SQL compiler won’t be smart enough to pick it up. In this case – it becomes a job (IMO more like Black Magic™) for your DBAs to convince the compiler to do whatever-it-should-do. Still – more often than not, the compiler is smart enough to use indexes at least in OLTP settings (as noted above – IMHO it is like 80%, though YMMV greatly). Moreover – for all-DBs-with-hints-I-heard-about – more often than not, “convincing” SQL compiler 19 is possible (especially if your DBAs are good, and do have experience with torturing “convincing” this particular RDBMS to do the right things).



[[To Be Continued…

This concludes beta Chapter 27(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 27(c), where we’ll continue our discussion on well-known ways for database optimization]]

[ + ] References [Knuth-Vol1] Donald E Knuth, “The Art of Computer Programming, vol. 1: Fundamental Algorithms” [Knuth-Vol3] Donald E Knuth, “The Art of Computer Programming, vol. 3: Sorting and Searching”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.