This is a sketch of how to provide highly concurrent read and update access to sorted paged lists while requiring minimal locking. This particular trick has probably been covered before but if so I’ve missed it and haven’t seen anyone else using it.

I implemented it a long time ago in a closed-source and now-defunct piece of software. It worked really well, and I’ve never since seen it done quite this way. There was some thought that it might be patentable and a literature search back then came up empty. But I’d be amazed if this weren’t being used here and there; who knows, maybe it’s crept into the undergrad CS curriculum while I wasn’t looking.

Page Table · Suppose we have a large ordered list containing some type that the “ < ” operator works with, and we’ve organized it into pages; we access them through a simple page table whose entries need contain only a base value for the page and a pointer to it.

Searching · For this to work, we assume we can search both the page table and the pages in a way that depends only on the ordering of the elements, and can tolerate duplicate elements (as in where the same object appears in two or more successive slots in a page). An example of such an algorithm is binary search, which is in fact what I used in my implementation, and I think is generally a good choice.

Reading · To find something in the list, you search the page table to find which page it’s in, then search the appropriate page, then you’re done. If you just want to retrieve information, you don’t need to acquire any locks at any time.

Shuffling · This is the lowest level of the process by which we actually or add or delete items from the table. I think the explanation will run smoother if I cover shuffling first.

Let’s consider a page in the table as a number declaring the element count, and then a series of numbers representing the items. For example, a page with four items represented as numbers:

4: 10, 20, 30, 40

Now suppose we want to add the value 15 , which will fall between 10 and 20 . Let’s assume there are more slots available in the page; the count 4 describes the number that are actually used.

I’ll refer to this process as shuffling-in. First we add a duplicate of the top element in space one off the declared end of the page.

4: 10, 20, 30, 40, 40

Now we increment the element count.

5 : 10, 20, 30, 40, 40

Now we shuffle the elements up, going from the top down.

5: 10, 20, 30, 30 , 40

5: 10, 20, 20 , 30, 40

Finally, we over-write the occupant of the correct position with the new value.

5: 10, 15 , 20, 30, 40

Note that at every point in this process, a binary search through the page, based on the declared number of elements, will produce a correct result, assuming it can tolerate the presence of duplicates (which binary search by default does).

I’ll call the inverse process shuffling-out; here we shuffle out the newly-added 15 .

First, we copy the to-be-deleted element’s right neighbor on top of it, creating a duplicate of the neighbor, and repeat the process until we’ve reached the top of the page.

5: 10, 15, 20, 30, 40

5: 10, 20 , 20, 30, 40

5: 10, 20, 30 , 30, 40

5: 10, 20, 30, 40 , 40

Finally we decrement the element count.

4 : 10, 20, 30, 40, 40

Once again, at no point during the shuffle have we disrupted anyone’s ability to binary-search the page correctly.

Simple Updates · Shuffling-in and shuffling-out allow a writer to update a page of the list while readers are active, without getting in their way. Thus an updater who wants to add or delete an element from the list needs to do the following:

Search the page table to find out which page will be affected. Acquire an update lock on that page. Shuffle in or out. Release the lock.

Difficult Updates · Things are a bit more complex when the page table needs to be updated, which happens in two cases: When a page is too big and needs to be split, or is too small and needs to be merged with another (or alternately, is empty and needs to be removed).

It turns out that you can shuffle the page table too. It’s a bit more complex; for example when you’re splitting a page with X entries whose base value is N into two pages with approximately X/2 entries the second of which has base value N2, you’d do this:

Acquire the page table update lock, then the lock of the page to be split. Construct the new page, starting at N2, with the top half of the entries. Shuffle it into the table. There will now be a bunch of entries duplicated between the big page beginning at N and the small page beginning at N2, but that won’t keep a reader’s search from running correctly. Change the count declaration in the page starting at N. Release the locks.

Similarly, merging small pages requires a bit of housekeeping around the shuffle, but it’s not rocket science and is left as an exercise for the reader. The trick is to keep things sorted and usable by read-only code at all times.

Finally, there’s a corner-case to the simple update where you delete the lowest entry in a page; this also requires locking the page table.

Hot Lock · In my implementation, I had every updater acquire the page table lock long enough to look at the target page and figure out whether the page table would need to be updated, then, assuming not, release it before it went to work on the target page.

I was worried that that lock would get hot and bottleneck the system, so I thought of some dodges including storing page metadata in the page table, and taking a two-pass approach with back-off; going straight to the page and then if it needed splitting or merging, backing off and acquiring the page-table lock.

Tuning and Experience · In practice, the page-table lock wasn’t a problem, probably since the amount of work involved in seeing if a page is going to split or merge is very small.

Another sin of omission was that I never bothered to merge pages, just deleted them if they emptied out entirely.

I was actually worried about cache coherency; when the system is running hot, it’s absolutely the case that two processes will be issuing simultaneous read and write requests to the same address at the same time. But the algorithm is pretty resistant to update latencies and I never observed any problems in practice.

Obviously this is not transactional; a reader process can fail to see an insertion that started some time ago if the updater is shuffling its way through a large page.

There’s only one obvious tuning parameter: page size. In my implementation there was a very high query rate but with less than 10% updates, and whatever page size I picked more or less at random on the first time through was good enough. It was fairly small; my intuition was that the cost of big shuffles would be higher than the cost of searching a larger page table.

This was done in a C program; the entries were fixed-size structs packed into huge mmap-backed buffers, so the entries were all just 64-bit pointers. I didn’t even persist the page-table or page structures, just constructed them when the program started up. The locking primitives were basic POSIX stuff, can’t remember which.

I could run hundreds of processes handling immense numbers of queries and quite a few updates per second through about as much memory as you would reasonably put on a generic Debian server a decade-and-a-bit ago. I never actually figured out how fast it would go or how to tune it because it was never the critical point in the system. My intuition is that you could distribute it across multiple systems without too much pain, but I haven’t thought it through.

I haven’t really any useful information about the range of update patterns and frequencies that this scheme might tolerate before becoming unworkable. My intuition is that there are some patterns where it’ll outdo quite a few of the other off-the-shelf options, but I have no hard evidence.

That’s All Folks · The other night, I had insomnia and for some reason thought about this trick, and figured out I should write it down before I forget it, in case someone else finds it useful.