In SQL Server 2019 CTP 3.1 a new index option was added called OPTIMIZE_FOR_SEQUENTIAL_KEY that is intended to address throughput issues that may occur when a workload has a high number of concurrent inserts. These workloads commonly face an issue known as last page insert contention.



In SQL Server, indexes are stored as B-trees which are ordered by the index key. In a clustered index, the leaf level contains the data pages themselves, ordered by the index key. This means that when you are inserting rows into a table that has a clustered index where the key is a value that is always increasing, such as an identity column or a datetime column, all the new rows will be inserted on the last page of the B-tree. As concurrency increases, contention for this page in memory will increase, which limits scalability of such workloads. This problem can be identified by many threads waiting with a wait type of PAGELATCH_xx on the same page resource in a user database. In an OLTP system that is frequently inserting data, PAGELATCH waits will often be observed, particularly on a busy system.



There are many different methods to address this issue that have been suggested in the past, most of them involve making changes to either the application or the structure of the index. The following article outlines some of the most common solutions:



How to resolve last-page insert PAGELATCH_EX contention in SQL Server

https://support.microsoft.com/kb/4460004



When thinking about addressing this problem automatically in SQL Server, these options as well as others were considered, but most of them require some sort of performance trade-off and would involve major changes to the database engine. What we decided to do instead was to look closer at the contention itself and see if there was some way to improve scalability without making major changes to the way the data is physically stored.



While PAGELATCH contention does introduce additional latency in an application, it is something called a latch convoy that is truly detrimental to application scalability. This blog article gives a good description of convoys and why they happen. Traffic jams are a common analogy used to describe the problem. If you have a road that is at maximum capacity, as long as all the traffic continues to move at the same speed, throughput will be consistent. If you introduce a bottleneck on the road that forces all the cars to go through a single lane, this will cause traffic to back up behind the bottleneck. Traffic will continue to make progress at a slower rate, but if the rate of cars entering the highway remains constant, the traffic jam will get larger and larger unless the bottleneck is removed. In this situation, if something occurs that causes the drivers entering the single lane to hit the brakes, such as a slow driver or a hazard on the road, the rate of throughput drops severely and traffic really piles up. At this point, throughput won't recover until the rate of cars entering the road slows down dramatically, much lower than what the road would normally be able to handle.



With last page insert contention, as the number of insert threads increases, the queue for the page latch increases which in turn increases latency. Throughput will also decrease, but if something slows down one of the threads that is holding the latch, this can trigger a convoy and throughput suddenly falls off a cliff. This typically happens when a page fills up and a new page must be added to the index (also known as a page split). The insert that triggers the new page will naturally have to hold the latch for longer than normal while the new page operation completes. This causes the queue to build up behind the latch. Adding a new page also requires an exclusive latch on the parent page, which can cause latch requests to queue at that level as well. At this point, throughput falls off a cliff.

OPTIMIZE_FOR_SEQUENTIAL_KEY aims to do two things – control the rate at which new threads are allowed to request the latch, and favor threads that are likely to keep the throughput high. These techniques will not prevent the contention or reduce latency, but they will help keep throughput consistent as concurrency increases.

When using this option, you may not see a decrease in PAGELATCH waits, in fact you may even see an increase in waits with a new wait type called BTREE_INSERT_FLOW_CONTROL. Despite these waits, you should be able to achieve much better throughput, and scale to a much higher number of concurrent threads, without hitting the proverbial cliff. If you're not experiencing the convoy phenomenon in your workload, you may not see a huge benefit from this option, and you may even see a slight degradation in performance due to the new flow control waits. You should only use this option if you have a very heavily contentious workload – one where the number of threads inserting into the index is much higher than the number of schedulers – on a clustered index with a sequential key (note that non-clustered indexes can experience this problem as well, but because they have a smaller row size they don’t have as high a tendency to form convoys so they are less likely to benefit from this option). We are also working on providing some new reporting in Azure Data Studio and SQL Server Management Studio that will help detect this scenario and recommend the option where appropriate.

Hopefully that’s enough of an explanation for the average user, but I know many of you out there are thinking “Yeah but what are you DOING under the covers? How does this really work?” To dig a little deeper, as you may have guessed from the name of the new wait type, enabling OPTIMIZE_FOR_SEQUENTIAL_KEY turns on what we call flow control for the index in question. When OPTIMIZE_FOR_SEQUENTIAL_KEY is set to ON for an index, we introduce an upstream flow control mechanism for all the threads that request to enter the critical section of SQL Server code that is protected by a page latch. Unlike a semaphore which simply controls the traffic based on a fixed number of threads, this new mechanism controls the flow of traffic based on a set of conditions such as which CPU the thread is on, the state of the thread, and/or the rate at which the thread is completing work.

With OPTIMIZE_FOR_SEQUENTIAL_KEY, we will limit the number of threads allowed to request the latch to one per scheduler (i.e. CPU core), which should reduce time spent in the runnable queue once the latch is acquired. In addition to this, we will also favor threads that are more likely to make progress once they acquire the latch. This “unfair” mechanism gives preference to threads that are likely to complete their work in a single quantum over threads that will undergo several context switches while holding the latch. While this unfairness may result in some threads experiencing additional latency, overall throughput will be increased because less time will be spent waiting for other resources while holding the latch and blocking other threads.

Keep in mind that this flow control is not going to improve individual insert latency, in fact quite the opposite. Think of it like the traffic lights on a highway on-ramp – the cars coming onto the highway are going to wait longer before they are allowed to enter. It’s going to make them a bit slower than normal, but the result is the traffic keeps moving at a constant (albeit slower) rate instead of piling up. Our initial testing shows up to 30% improvement in throughput on some workloads, but of course your mileage may vary. The other solutions to last page insert contention outlined in the KB article referenced above are also still valid, and in many cases will yield an even better throughput improvement over OPTIMIZE_FOR_SEQUENTIAL_KEY. The best way to address the problem is to test the various options and choose the one that works best for your application and environment.

If you'd like to see an example of OPTIMIZE_FOR_SEQUENTIAL_KEY in action, head over to the SQL Server Samples repository on GitHub and download the sample demo.