The model for the cost of operations used by the SQL Server query optimizer to produce an execution plan is rather simple. It has been largely unchanged since SQL Server version 7, RTM in 1998. Back then, SQL Server had not yet achieved tier 1 status with commensurate development resources. Some adjustments were made in version 2005. One might question how something so old can possibly be relevant today, considering how much hardware has changed over the intervening years. The cost model is still adequate for pure transaction processing scenarios in which queries have highly selective search arguments. Its limitations become pronounced in parallel execution plans.

The salient aspect of modern hardware is that compute capability is distributed over very many processor cores, possibly with irregularities due to non-uniform memory access (NUMA). As such, the decision in parallelism cannot be just on/off with a cost threshold and one-size fits all with max degree of parallelism. It is necessary to have a degree of sophistication in the employment of resources for parallel processing. For some queries, it might be best to utilize a low to moderate level of parallelism. In others, perhaps all cores in one socket. And in exceptional circumstances, the whole set of cores.

For this, the existing SQL Server cost model does not have sufficient depth to be the foundation from which an effective strategy for deploying resources can be formulated. It should not be too difficult to build a suitable model, considering the wealth of knowledge that has (or should have been) accumulated over the years. There will be some resistance to changing the cost model given that people have become accustomed to the plans generated from the current cost model. But that could be accommodated as a legacy mode, so as to not stand in the way of progress.

The SQL Server Plan Cost Model Overview

The cost model of the SQL Server Query Optimizer is explored in more detail elsewhere. Other authors have written on the query optimizer itself. Only a brief description of the SQL Server cost model is given here.

Buried in Books Online in the section on the query governor cost limit option, is the statement: "Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration." The reference system existed when SQL Server version 7 was in development, so this could have a been a Pentium Pro system, but some people have said that it was a Pentium. My investigation in this topic began around year 2000, though I never went back to look at whether SQL Server version 6 and 6.5 used the same model.

Each operator in an execution plan can have two components, CPU and IO. The IO component represents the time to complete the IO operation. There is in fact also CPU involved to perform an IO, and it is unclear whether this is part of the IO or represented in the CPU component. The assumption is that leaf level pages are not in memory at the start of query execution, but maybe retained if the execution touches a page more than once. The complete cost of operator factors in the CPU, IO, number of rows and number of executions. The CPU and IO may represent the cost for only a single execution, while the operator cost incorporates the entire operation as a function of the number of executions.

The IO cost is based on a model of the disparate nature of hard disk performance in random and sequential access at fixed queue depth. SQL Server version 7 and 2000 had two baselines, but from version 2005 on, the IO cost model is that random IO performance is 320 IOPS, and sequential IO is 1350 pages per sec (10.8MB/s). As cost is in seconds, the random IO from key lookup and loop join operations or the first page of a scan is 1/320 = 0.003125. Each successive page in a scan or range seek is 1/1350 = 0.000740740 (the 740 sequence repeats).

The plan cost does not to model the cost of logic beyond the basic operation. For example, an aggregation of one column has the same cost as two or more. The interpretation of this is that the purpose of the plan cost model is to find the best index and table access sequence and method. Ancillary logic that must be executed in all cases does not affect the choice of plan.

Hard Disks 1995-2005

It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.

Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.

In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000's were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.

In a SAN system, where the vendor was exclusively focused on random IO, the typical configuration supported sequential performance of about 150 IOPS at 64KB, for 9.6MB/s or 1200 pages per sec. Quite by accident, this was approximately in line with the SQL Server query optimizer. In any case, regardless of the actual random and sequential values, the important criterion is the sequential to random IO ratio, on the assumption that leaf level pages are not in-memory. And this was usually not true by the 2005 timeframe, when hot data was likely to be in memory for a well-tuned transaction processing system on a maximum memory configuration 4-way system.

Queue Depth

All the numbers cited above for HDD random IOPS performance were for queue depth 1 per disk. Of course, SCSI HDDs support elevator seeking, in which IO is issued at high queue depth, and the disk reorders the sequence of operations for higher IOPS. The SQL Server engine facilitated this by issuing IO at high queue depth when the estimated of number executions of a key lookup or loop join was 25 or more. Below this, IO was issued at queue depth 1. But the plan cost did not have a model for HDD random IO at different queue depths.

It should be pointed out that individual query performance optimization is not always the prime objective. For a transaction processing system that also supports reporting queries, the first priority is usually transaction latency. In this case the preferred strategy is to sacrifice reporting query performance by not issuing IO at excessively high-queue depth to keep latency low.

SSDs

SSDs started to pick up momentum around 2010. Early use was for extreme requirements. Now it is the more practical choice for almost all main line-of-business systems over HDDs. This comes at a time when the system may also happen to have enough memory for almost the entire database. With data on SSD, the access times for sequential and random might reflect a lower ratio than the optimizer model as opposed to a much higher ratio for HDDs in the FTDW RA. Still, the CPU cost for loading 64KB or more into memory with a single IO is lower than evicting pages and loading multiple individual pages. The scan should have a lower cost per page.

There is a less appreciated aspect of SSD storage in being able to achieve massive IOPS capability far in excess of what is really necessary. In an HDD system, it was possible to support sufficiently high random IOPS at less than excessive latency for transactions. It was also possible to achieve high bandwidth for large block IO in support of DW. What was not entirely practical was to simultaneously support low latency IOPS for transactions and high bandwidth for DW. This is now possible with an SSD storage system and this should be the configuration objective. It would be helpful for SQL Server to implement the correct IO queue depth strategy for this capability.

Key Lookup - Scan Cross-over

For pure transactions, queries have a highly selective search argument. In such cases, the plan cost model is not hugely important, regardless of wild differences between the plan cost model and various implementations of real hardware. The cost model is more important for queries involving an intermediate number of rows, in which the true cross-over point from an index seek with key lookup to scan operation is of greater interest. But even then, we may prefer one of the two for other reasons.

Parallel Cost Model Inconsistency

There is a curious inconsistency in the cost model for parallel execution plans. The CPU component of operator cost is scaled down in proportion to the degree of parallelism, except for the last doubling of the number of processor (cores). At DOP 2, the CPU cost is half of that at DOP 1. On a system with 20 logical processors and unrestricted parallelism, the CPU cost is reduced by a factor of 10, regardless of whether it is 10 physical cores with Hyper-Threading or 20 cores without HT. The inconsistency occurs in IO cost. There is one model for source access, that is seeks and scans to table or indexes. And there is a different model for intermediate operations, examples being Hash Match and Sort operators.

Parallel Cost - Table and Index IO

The IO component in table and index access operations, including index seek, scan, and key lookup, does not change with parallelism. The interpretation of this is that IO system is saturated with a single thread. Hence, a parallel plan does not reduce IO time in accesses to tables and indexes. This is a very reasonable assumption for the original reference system having 2 or 4 HDDs.

It is a not an accurate representation for scan operations on a system configured to the FTDW RA, capable of multi-GB/s IO bandwidth. The saturated IO model might be somewhat applicable for random IO because a single processor core from about 2006 on could drive 50K IOPS depending on the situation and only a very large array of perhaps 200 15K HDDs could support that volume at queue depth 1 per HDD.

On a modern system with properly configured SSD storage, the expectation should be that both sequential and random IO scales with parallelism, fully up to the number of physical cores. An example of a properly configured system is test system from Parallel Execution in SQL Server 2016, one Xeon E5 2640 v4 10-core processor with 4 Intel P750 PCI-E SSDs.

Parallel Cost - Hash and Sort Operations

Hash and Sort operations for a small data set only have a CPU component. When the number of rows times the row-size exceeds a value, there will be an IO component as well. The set point is some fraction of memory, either the max server memory or system memory if the prior is unlimited. This appears to be just over 5 MB per GB of memory. This value is also per thread. At higher DOP, each thread can accommodate up to the set point. Beyond the set point, the IO is somewhat greater than nonlinear, at perhaps IO 25 for 10MB, and 300 for 100MB. However, the IO cost in hash and sort operations does scale with parallelism, unlike IO cost is index seek and scan operations.

This might seem to be the min query memory configuration setting and the resource governor max memory setting in actual runtime behavior. But it might be that the plan cost model only factors in the max server memory setting.

Plan Cost Scaling - Parallelism

Below is the ratio of plan cost at DOP 1 to higher DOP between 2 and 10 for the TPC-H SF 100 database on the Xeon E5-2630 v4 system with 64GB, but SQL Server limited to 32GB. In essence, this is the predicted speedup from parallelism by the plan cost model.

In several queries, there is very little reduction in plan cost at higher DOP. This occurs when the bulk of the costs are from the table and index access operations, in which parallelism does not reduce IO cost, only the CPU cost is reduced.

In a few queries, examples being 8, 16 and 18, there is significant scaling in plan cost. This occurs primarily when much of the cost is in the hash match or sort operations, in which case, both CPU and IO are reduced with parallelism.

In any case, actual scaling is always better than plan scaling, as shown below. See Parallel Execution in SQL Server 2016 for a more complete discussion.

The SQL Server plan cost model has very little predictive value on the matter of scaling. The plan cost model only predicts scaling when a significant portion of the cost is in the IO component of hash and sort operations. When the cost is largely in the IO of table and index scans and range seeks, the plan predicts weak scaling. In fact, many operations seem to scale well when the IO subsystem is not saturated from a single thread.

True Cost versus Plan Cost

As mentioned earlier, the plan cost model was based on a reference system that was probably a Pentium processor from the mid-1990's. In that era, each successive generation of processor increased rapidly in performance. It would not make sense to recalibrate the cost model to newer generation processors. And furthermore, it does not really matter what the processor is. Plan cost tends to be dominated by the IO component of table and index accesses based on 320 IOPS random and 10.5MB/s sequential, or from the IO cost of hash and sort operations. Still, one might be curious as how plan cost correlates to actual cost.

The sum of the plan cost for the 22 TPC-H SF100 queries at DOP 1 is 200,839.65. Note that many of the execution plans have large hash operations for which the IO cost is dependent on system memory. The sum of the execution times for the same 22 queries also at DOP 1 is 3,515.15sec, although it would be less if some queries were manually tuned, resulting in higher plan cost, but lower runtime. The ratio of the sum of the plan costs to actual runtime is 57.135.

Curiously, this is the approximate order of magnitude difference between a Pentium and modern processor core. There was no particular expectation of this because the plan cost model represents mostly IO wait while the test system has sufficiently powerful IO that the ratio of worker time to elapsed time was 98%.

The figure below shows actual query runtime divided by plan cost scaled by 57.1, for the 22 TPC-H queries.

The range in variation between actual execution time relative to plan cost scaled by 57 is between 0.32 and 2.53. This is actually not a bad range, because the query optimizer plan cost does not attempt to model the true cost of a query, only the methods of table and index access.

Two of the outliers on either end are Q2 at 0.32 and Q13 at 2.53. The critical part of the plan for Q2 is shown below.

The large portion of the plan cost is for the loop join to PARTSUPP. As pointed out earlier and elsewhere, the plan cost for random IO is fairly high. Most of the operations in the TPC-H queries are scans or range seeks. The bulk of weight in the measured 57X scaling relative to plan cost is driven by the plan cost of sequential IO.

Also, disregarding the plan cost model, the SQL Server engine actually issues IO at high queue depth when the estimated rows is more than 25. The expectation is that a storage system with 4 NVMe PCI-E SSDs blows through the loop join faster than the 57X average scaling.

The other extreme is Q13, for which the core plan component is below. There is nothing unusual in the plan.

However, in the query search argument is the following:

AND CUSTOMER not like '%%special%%requests%%'

This expression is expected to be expensive to evaluate, and is not modeled in the plan cost, being treated as just any other predicate.

Compression

With compression, the plan cost for the 22 TPC-H SF100 queries at DOP 1 decreases by 32% from 200,839 to 136,792, reflecting lower IO cost from compressed tables. The actual execution time increases by 21% from 3,515 to 4,275sec, reflecting the overhead of compression significantly outweighing the reduced IO.

Cost Threshold for Parallelism

The original scale of plan cost was that 1 represented a simple estimate of execution time in seconds. This has long lost its original meaning. For all practical purposes, from a few years after its arrival, we treated cost as some arbitrary unit of measure. Except for the fact that plan cost was used in the cost threshold for parallelism. In the original model, the intent was that the query optimizer does not consider parallelism until the estimated execution time of the non-parallel plan exceeds 5sec, which was once a very reasonable value.

Today, plan cost 5 might be a query that executes in less than 0.1 sec without parallelism. This is most definitely well below the point at which parallelism should be engaged. The overhead of coordinating multiple threads is heavy in relation to the actual work itself. But we should also not wait until plan cost 285, corresponding to 57 × 5, for the original 5 sec duration. Part of this is because people today are far less patient than in the previous era. The other part is that today we have so many processor cores at our disposal that we can now be more liberal in their use than before.

Parallelism Operators

Having discussed both the cost model and the to some degree actual scaling, it is now the time to lay the ground work for a new model of plan cost and parallelism. A parallel plan involves splitting the work to be done among multiple threads. This involves additional work not in a non-parallel plan, comprising one or more of: distribute, repartition or gather streams.

The effectiveness of a parallel plan depends on both the ability to divide the work with some degree of uniformity and such that the extra work necessary to coordinate multiple threads not being too large. There is also some effort necessary to start a thread or to acquire threads from a pool.

The Bitmap Operator

There is also another factor in parallelism. As discussed elsewhere, there are a number of queries in which the parallel execution plan runs with lower CPU time than a non-parallel plan, which should not happen. It turns out that the parallel plan has a bitmap operator that is not in the non-parallel plan.

The following is from Microsoft TechNet on Bitmap Showplan Operator: " SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering. "

As it turns out, the bitmap operator might have the effect of improving plan efficiency by up to 30% depending on the circumstances. Microsoft could allow the bitmap operator to occur in plans above a certain cost threshold, but not parallel for whatever reason. That is, a separate the cost threshold for the bitmap operator from the cost threshold for parallelism. I would not hold my breath on this.

Processor and System Architecture

The foundation of the parallelism strategy must encompass the nature of the modern processor and system. The largest Intel Broadwell EP/EX of 2016 has 24 cores, the full set of which is available in the Xeon E7-v4, but only 22 in the E5-v4.

While we may have been conditioned from past history into viewing a server system as having multiple processor sockets, that fact is that a system having a single processor socket today can be incredibly powerful. There are significant benefits for a single socket system. One of which is that all memory is local. Another substantial benefit is that the effort to coordinate between multiple threads is low. This leads to excellent scaling characteristics even when there appears to be significant inter-thread redistribution activity.

For those who cannot let go of the multi-socket mind-lock, below are representations of a 2-socket E5

and a 4-socket Xeon E7.

The inescapable fact is that coordinating between threads running on cores in different sockets is very substantial. And so, there is a very substantial degradation in scaling when threads are on different sockets. For this reason, the parallelism strategy must respect the nature of the system architecture. Low to moderate parallelism must use cores on a common socket. There should probably be one strategy in employing parallelism up to the number of cores in one socket and a separate strategy for high-parallelism spanning cores on multiple sockets.

A Modern Parallelism Strategy

With this knowledge, the parallelism strategy needs to blend the method for partitioning of work with the nature of the modern server processor and system. There are very many cores, i.e., compute resources that can be allocated to query plans, in either a single or multi-socket system. For this, we need something more than the very basic, legacy configuration controls: 1) Cost Threshold for Parallelism and 2) Max Degree of Parallelism. To Connect request on this matter, Microsoft replied that the newer additional controls in Resource Governor is sufficient. In essence, the SQL Server team is abrogating their responsibility in having an intelligent mechanism to properly utilize the capabilities of a modern system.

The first step might be to rescale the plan cost so that the cost threshold for parallelism is in more understandable terms. This is not absolutely essential, and we could just state that plan cost 60 might correspond roughly to 1 second. It did not make sense to rescale plan cost in the days when processor core performance changed substantially with every generation. But performance at the core level over the last several years have been more stable, improving at a slow or moderate pace, and may continue to do so barring a radical event. We might take this opportunity to reset the plan cost.

Second, the cost model needs to be more comprehensive. The cost of touching each row in a scan or other sequential access is larger than the current model represents and should be corrected. For this, the impact of lock-level might be considered. The cost of touching each column is both non-trivial and not accounted for at all in the current model. The cost of ancillary logic also needs to be modeled. And of course, the cost of access to compressed pages should be accounted for. The above items may not impact the structure of the execution plan, which is why it was not considered back in version 7 or 2008 for compression. But it is relevant in the cost threshold for parallelism assessment, and also impacts scaling projections.

Third, the IO cost model needs to be fixed. This might involve modeling the actual IO subsystem performance characteristics. Or it might just assume that IO performance scales with parallelism. Also important is that parallelism scaling does not stop at the half the number of cores, instead going up to the full number of physical cores. It would help if the optimizer knew the difference between physical and logical cores. The IO cost in hash and sort operations already scale with parallelism, so fix the IO cost for source access would make the complete model consistent.

As a rough guess, (current) plan cost 30, corresponding to roughly one-half of one second execution time might be a good default cost threshold for parallelism, settable of course. For a query with plan cost 30 expected to execute in 0.5 CPU-sec (worker time), a DOP of 2 and not higher might be a good idea. One suggestion for the DOP strategy might be nonlinear. Employ DOP 2 at the base value of the cost threshold. Then double the DOP for every 4× over of the base cost. If base is 30, then plan cost 120 (4×30) is the threshold for DOP 4 and 480 (16×30) is threshold for DOP 8. This could be the strategy for DOP up to the number of cores in one socket with one threshold for the base value and another for escalating the DOP.

The expectation is that scaling is not so simple or excellent for parallelism on cores in different sockets. It would be good if the plan cost could model this. In any case, there might be an even higher threshold for employing cores over multiple sockets. We should not normally set DOP to be somewhat over the numbers cores in one socket. It should be up to the number of cores in one socket for most circumstances, or employ most of the cores of multiple sockets for exceptional circumstances. Furthermore, if the desired number of cores in one socket are not available, then reduce the DOP rather than employ cores on different sockets.

Summary

Both cost model and the simple controls for parallelism have become seriously obsolete for modern high core count processors. An improved cost model, encompassing cost structure that was not necessary for optimal single thread table access strategy, is now needed for the parallelism strategy. The cost model needs to correctly model parallelism on modern hardware, which means fixing the IO cost model that currently assumes a single thread saturates the IO. A proper parallelism strategy also needs more complexity than the current on/off based on the cost threshold for parallelism, and the all or nothing with only a MAX DOP setting. A moderately more complex model has been suggested but alternatives are possible.