The optimizer starts with an RSE (Record Selection) block created by the parser, along with an initial Record Source Block (RSB), which simply lists all the tables to be joined.

Examining the node tree that is part of the RSE block, the optimizer creates an RSB tree. Each RSB node describes a join, a boolean restriction, or a type of retrieval from a table.

The final structure describes the joins to be made, and in which order they should be made.

Table 1 – Main RSB types

The execution plan available through the isql commands SET PLAN or SET PLANONLY is simply a subset of the RSB tree created by the optimizer, that has been converted to more human-readable format.

The execution plan always has a format:

Where plan_definition consists of blocks enclosed in parentheses. Each block represents a data stream and the method for its processing. Blocks could be nested to represent the tree structure for joining of individual streams. Each block includes stream identification (for example table name) and access method (INDEX and ORDER with index name, or keyword NATURAL for sequential scan), and finally, descriptions of the operation performed on the streams (MERGE, JOIN, SORT).

As its first step, the optimizer looks at the RSE to identify key elements and to create more, derived elements. In this step optimizer will

2. If there are streams left that are not part of the river, form another best-case river, repeating until there are no streams left. As the last step, the optimizer merges rivers. The key here is the algorithm identifying best join order.

For all joins, the optimizer looks for indexed relationships between the tables in a join and tries to combine join fields using an index. Indexed streams are then joined in sequences to form “rivers”. Streams and rivers are processed in a loop, in which the optimizer:

The best join order for each particular, leftmost stream is computed in a recursive loop (for the next joined stream) as follows:

If a plan is specified, the order is already present within the streams vector, so the optimizer must settle for the one order as described by explicit plan. In this case the optimizer just verifies that the path specified by explicit plan really exists. The optimizer looks for the best join order in a loop, considering each data stream in turn as the leftmost stream in the join order. The best order from each consideration is then determined and placed into the opt block. So at the end of this loop, the opt block holds the best order.

In cases where more than one index matches the same boolean condition, the optimizer picks the one with better selectivity. Selectivity is a number that represents the ratio between the total count of index keys versus the count of duplicate key values, and is used together with table cardinality to compute the estimated number of data pages that must be read. However, when a condition matches only part of index key, index selectivity doesn’t necessarily correspond to the actual efficiency of the index for that particular condition. For example INDEX(NAME, SURNAME) would contain fewer duplicates than INDEX(NAME). Were the condition to be defined only on NAME, then the real selectivity of both indices would be the same.

Indices can be considered only for conditions that compare the full length or the starting part of column values. Therefore, indices are not used for the CONTAINING predicate, a LIKE predicate that begins with a pattern-matching character or for transformation functions (EXTRACT, SUBSTR, UDF etc.). However, Firebird is able to use more than one index for optimizing different parts of a filter condition, or use the same index for more than one part of the condition.

Decisions about using of indices, result from matching index keys with conditional elements of the statement (booleans).

On the other hand, where you often use complex conditions over multiple columns, it is better to define a composite index over these columns, because composite indices are more efficient for composite conditions (the engine can process one index instead of multiple ones) and usually have better selectivity. Searching by index doesn’t automatically mean better performance. Because Firebird must read additional pages (the index) in addition to data pages, using an index is of benefit only if the total number of processed pages would be less than a natural scan. The weight of an index is evaluated by the optimizer as part of the data retrieval cost computation. The outcome of such an evaluation may be that the optimizer ultimately decides not to use the index. The cost is a function of the estimated cardinality of the relation, index selectivity, and total boolean selectivity.

If it’s not necessary to speed up sorting or for referential integrity, it’s preferable to define simple indices instead of composite ones. Firebird is able to combine indices if necessary, and simple indices are much more flexible. More than that, keys of composite indices often overlap, and thus force the optimizer to decide which one is better for a particular case. Although optimizer will do its best, it’s not guaranteed that it will always pick the best one.

Execution plan - examples

The easiest way to read the execution plan and understand the optimization logic is to use some typical examples. The following examples use isql and the standard Firebird sample database EMPLOYEE.GDB. For simplicity, only indices listed are those that are defined for the tables used.

Table JOB:

RDB$PRIMARY2 (JOB_CODE,JOB_GRADE,JOB_COUNTRY) RDB$FOREIGN3 (JOB_COUNTRY)–REFERENCES COUNTRY (COUNTRY); DESCENDING MAXSALX (JOB_COUNTRY,MAX_SALARY) MINSALX (JOB_COUNTRY,MIN_SALARY)

Table COUNTRY:

RDB$PRIMARY1 (COUNTRY)

Table PROJECT:

RDB$PRIMARY12 (PROJ_ID) UNIQUE RDB$11 (PROJ_NAME) RDB$FOREIGN13 (TEAM_LEADER)– REFERENCES EMPLOYEE (EMP_NO) UNIQUE PRODTYPEX (PRODUCT,PROJ_NAME)

Table EMPLOYEE:

RDB$PRIMARY7 (EMP_NO) RDB$FOREIGN8 (DEPT_NO)–REFERENCES DEPARTMENT (DEPT_NO) RDB$FOREIGN9 (JOB_CODE,JOB_GRADE,JOB_COUNTRY)–REFERENCES JOB (JOB_CODE,JOB_GRADE,JOB_COUNTRY) NAMEX (LAST_NAME,FIRST_NAME)

Table PROJ_DEPT_BUDGET:

RDB$PRIMARY17 (FISCAL_YEAR,PROJ_ID,DEPT_NO) RDB$FOREIGN18 (DEPT_NO)–REFERENCES DEPARTMENT (DEPT_NO) RDB$FOREIGN19 (PROJ_ID)–REFERENCES PROJECT (PROJ_ID)

Example 1: A Simple SELECT:

SET PLANONLY ON; SELECT * FROM JOB ; PLAN (JOB NATURAL)

This plan is really simple:

Only one table is processed – JOB. Data stream is processed by a sequential read – keyword NATURAL.

Now let’s look at this same SELECT statement, but with different filter conditions in WHERE clause.

Example 2: Filter condition on non-indexed column:

SELECT * FROM JOB WHERE MIN_SALARY > 50000 ; PLAN (JOB NATURAL)

Plan is the same as before, because there is no index on MIN_SALARY.

Example 3: Filter condition on indexed column:

SELECT * FROM JOB WHERE JOB_CODE ='Admin' ; PLAN (JOB INDEX (RDB$PRIMARY2))

Keyword INDEX in the execution plan indicates that the index is used to narrow the read only to candidate rows. There is no index defined on JOB_CODE itself, but JOB_CODE is at the start of the key in the primary index (RDB$PRIMARY2), so this index is used.

Example 4: Filter condition on column with more than one index:

SELECT * FROM JOB WHERE JOB_COUNTRY ='Japan' ; PLAN (JOB INDEX (MINSALX))

Column JOB_COUNTRY is at the beginning of three indices: the foreign key to table COUNTRY, MAXSALX and MINSALX. The pptimizer picked the MAXSALX index because it has the best selectivity.

Note: Index selectivity is stored in the system table RDB$INDICES, and you can get it by issuing the following statement:

SELECT RDB$INDEX_NAME,RDB$STATISTICS FROM RDB$INDICES ;

Selectivity is a number between 0 and 1. A lower number means better selectivity.

Example 5: Filter condition on more columns with indices:

SELECT * FROM JOB WHERE JOB_COUNTRY ='Japan' AND JOB_CODE ='Admin' ; PLAN (JOB INDEX (RDB$PRIMARY2,MINSALX))

When a filter condition is defined on multiple columns that match to multiple indices, Firebird merges bitmaps produced from each index by appropriate bitwise AND or OR operation into single bitmap.

Example 6: Output sorting I:

SELECT * FROM JOB WHERE JOB_GRADE =2 ORDER BY MIN_SALARY; PLAN SORT ((JOB NATURAL))

Inner parenthesis represents the access to the data stream (JOB), while outer parenthesis defines the sorting operation – SORT (data_stream). The table is read sequentially, because there isn’t an index on JOB_GRADE, neither does this column appear at the beginning of any index. By the same rule, there are no defined indices that could be used for MIN_SALARY.

Example 7: Output sorting II:

SELECT *FROM JOB WHERE JOB_COUNTRY ='Japan ' ORDER BY MAX_SALARY; PLAN SORT ((JOB INDEX (MINSALX)))

Data access is optimized by index MINSALX, but sorting over MAX_SALARY by navigational index is not possible.

Example 8: Output sorting III:

SELECT * FROM JOB WHERE JOB_CODE ='Admin' ORDER BY JOB_CODE; PLAN (JOB ORDER RDB$PRIMARY2)

In this case, the sort is optimized by a navigational index (keyword ORDER). When a navigational index is used, it’s not possible to decipher from the plan if other indices are used for filter conditions. In this particular case, the filter condition is optimized also by RDB$PRIMARY2.

Example 9: Join of two tables (Inner join):

SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY; PLAN JOIN (JOB NATURAL, COUNTRY INDEX (RDB$PRIMARY1))

Stream joins are indicated by JOIN (list of joined streams), and it’s realized as a loop over the first (left) stream with a lookup in the second (right) stream. Because this join doesn’t contain filter conditions, it’s necessary to process all rows from first table.

The optimizer picked up the loop over JOB table, which will be processed sequentially, and joined it with the corresponding rows from COUNTRY using index RDB$PRIMARY1. This index is obviously the best bet, as the index for the primary key has unique key values, so the total number of processed rows would be minimal. The second factor in this decision is the number of rows in both tables (after filtering). In this particular case (JOB contains 31 rows, whilst COUNTRY contains 14 rows), the difference in size between both tables in favour of COUNTRY was not significant enough to get rid of the benefit of using the unique index.

If the statement had defined the join the other way around, it would have had no impact on the final decision taken by optimizer:

SELECT * FROM COUNTRY JOIN JOB ON COUNTRY.COUNTRY = JOB.JOB_COUNTRY; PLAN JOIN (JOB NATURAL, COUNTRY INDEX (RDB$PRIMARY1))

But a different situation will occur on slightly different statement:

SELECT * FROM DEPARTMENT D JOIN PROJECT P ON D.MNGR_NO = P.TEAM_LEADER; PLAN JOIN (D NATURAL, P INDEX (RDB$FOREIGN13))

In this case the optimizer picked up the loop over PROJECT table, because neither candidate index is unique, so the size ratio between the two tables (DEPARTMENT has 21 rows, PROJECT has 6 rows) does matter here.

Example 10: Join of two tables (outer join):

SELECT * FROM COUNTRY C LEFT OUTER JOIN JOB J ON C.COUNTRY = J.JOB_COUNTRY; PLAN JOIN (COUNTRY NATURAL, JOB INDEX (MINSALX))

The one-sided outer join doesn’t allow the optimizer to pick the leftmost table for the join, because all the rows from the outer table must be processed. But there are multiple indices on the JOB table that are applicable to JOB_COUNTRY, so the optimizer has to choose one. In this case the MINSALX index is chosen due to its selectivity, although selectivity doesn’t really matter in this particular case.

Example 11: Join with filter condition:

SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY WHERE JOB.JOB_CODE ='Admin'; PLAN JOIN (JOB INDEX (RDB$PRIMARY2), COUNTRY INDEX (RDB$PRIMARY1))

This statement differs from the one in example 9 only by the additional filter condition on the JOB_CODE column. The join order remains the same, but the table JOB is filtered by index before the join:

SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY WHERE JOB.JOB_COUNTRY ='Japan'; PLAN JOIN (COUNTRY INDEX (RDB$PRIMARY1), JOB INDEX (MINSALX,MAXSALX))

Again the same join, but this time with a filter condition on JOB_COUNTR, that happens to be a foreign key. This time the optimizer completely changed the join order, because JOB_COUNTRY in table JOB is a foreign key in a table that is involved in a join. Thanks to distributed equivalencies (do you remember when I talked about them?) this condition is also applicable to the COUNTRY table. Because the condition applied to the COUNTRY table matches the primary key, and thus evaluates to a single row, it’s better to “loop” over this one record and perform a lookup in the JOB table.

JOB table is also processed via index, actually twice

First, the index is used for filter condition.

2. Second, the index is used for lookup in join loop. Both lookups are over the same column – JOB_COUNTRY, but it’s really performed twice. The first lookup is for the filter condition and the resultant bitmap persists over the processing of the whole statement. The second lookup is performed for each COUNTRY row processed in the join loop (actually, only one). The fact that the optimizer picks two different indices (MINSALX and MAXSALX) for each lookup is one of the optimizer’s mysteries.

Example 12: Join of three tables:

SELECT * FROM PROJECT JOIN EMPLOYEE ON PROJECT.TEAM_LEADER = EMPLOYEE.EMP_NO JOIN JOB ON EMPLOYEE.JOB_CODE = JOB.JOB_CODE WHERE JOB.JOB_COUNTRY = 'Japan'; PLAN JOIN (JOB INDEX (MINSALX), EMPLOYEE INDEX (RDB$FOREIGN9), PROJECT INDEX (RDB$FOREIGN13))

When more than two tables are joined, the join order has a great impact on the speed of joins. In this case, the statement defines the join as EMPLOYEE->PROJECT->JOB. Because of the filter condition on the JOB table, the optimizer changed the order to JOB->EMPLOYEE->PROJECT.

Note: The optimizer doesn’t know the real size of data sets after filtering; it performs a crude guess based on table size, index selectivity, type of condition and other factors.

For the same join but without the filter condition, the join order would be different:

SELECT * FROM PROJECT JOIN EMPLOYEE ON PROJECT.TEAM_LEADER = EMPLOYEE.EMP_NO JOIN JOB ON EMPLOYEE.JOB_CODE = JOB.JOB_CODE; PLAN JOIN (PROJECT NATURAL, EMPLOYEE INDEX (RDB$PRIMARY7), JOB INDEX (RDB$PRIMARY2))

As you can see, in this case the optimizer rightly favoured the order with lookups using unique indices.

Example 13: Complex join of four tables

The strengths and weaknesses of any optimizer show up in complex queries. For example, say we want to list all projects with budgets for individual departments working on projects in 1994, but just for those projects whose team leaders work in countries with a dollar currency. It’s a pretty wild request that would require a complex query.

All data we want to list are taken from joining PROJECT->PROJ_DEPT_BUDGET, but we must define an additional PROJECT->EMPLOYEE->COUNTRY join for filter condition.

One way to write this request would be:

SELECT P.PROJ_ID,P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID JOIN (PROJECT P2 JOIN EMPLOYEE E ON P2.TEAM_LEADER = E.EMP_NO JOIN COUNTRY C ON E.JOB_COUNTRY = C.COUNTRY)ON P.PROJ_ID = P2.PROJ_ID WHERE C.CURRENCY ='Dollar' AND PB.FISCAL_YEAR = 1994; PLAN JOIN (PB INDEX (RDB$PRIMARY17), P2 INDEX (RDB$PRIMARY12), E INDEX (RDB$PRIMARY7), C INDEX (RDB$PRIMARY1), P INDEX (RDB$PRIMARY12))

The optimizer did a good job with join order here, but didn’t detect the double use of the PROJECT table as P and P2, in order to reduce it to a single use. The next statement performs the same query, but the join is specified differently, without double use of PROJECT table:

SELECT P.PROJ_ID,P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.PROJECTED_BUDGET FROM (PROJECT P JOIN EMPLOYEE E ON P.TEAM_LEADER = E.EMP_NO JOIN COUNTRY C ON E.JOB_COUNTRY = C.COUNTRY) JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID WHERE C.CURRENCY ='Dollar' AND PB.FISCAL_YEAR = 1994; PLAN JOIN (PB INDEX (RDB$PRIMARY17), P INDEX (RDB$PRIMARY12), E INDEX (RDB$PRIMARY7), C INDEX (RDB$PRIMARY1))

As you can see, the new execution plan is optimal in the same way as before, but this time without doubling the processing of the PROJECT table.

Example 14: Sorting and aggregation

The next statement lists all projects with departmental budgets for individual years:

SELECT P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.FISCAL_YEAR,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID ORDER BY P.PRODUCT, P.PROJ_NAME; PLAN SORT (JOIN (PB NATURAL, P INDEX (RDB$PRIMARY12)))

According to this plan, the engine joins the source streams into one, then sorts that. Sorting could be optimized by an index. In this particular case, the ORDER BY clause matches the PRODTYPEX index, but this index cannot be used, because table PROJECT is not the controlling (leftmost) table for the join.

In next example, the situation is completely different:

SELECT P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.FISCAL_YEAR,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID ORDER BY PB.FISCAL_YEAR; PLAN JOIN (PB ORDER RDB$PRIMARY17, P INDEX (RDB$PRIMARY12))

Ordering is defined on the indexed column FISCAL_YEAR, which is part of controlling table for the join. Thus it is possible to perform the join loop with a navigational index to get a sorted list without sorting. The use of navigational index is indicated by the ORDER keyword in the plan.