Retroactive data structures Today's lecture is our second and final lecture on time travel, or more precisely, temporal data structures. Here we will study retroactive data structures, which mimic the "plastic timeline" model of time travel. For example, in Back to the Future, Marty goes back in time, makes changes, and returns to the present that resulted from those changes and all the intervening years. In a partially retroactive data structures, the user can go back in time, perform an additional operation, return to the present that results, and query the resulting data structure. In this way, we maintain a single (changing) timeline, consisting of the sequence of update operations. Retroactive data structures can add, or remove, an update at any time, not just the end (present). A fully retroactive data structure can furthermore query the data structure at any time in the past. Finally, a third type of retroactivity, called "nonobvious", puts queries on the timeline as well, and reports the first query whose answer changed. As you might expect, retroactivity is hard to obtain in general, but good bounds are known for several data structures of interest. [Details and references] This lecture introduces the retroactive data structure and a new computation model, the cell probe model. Retroactive data structure maintains a linear timeline and allows updates to be performed at any time [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] . Partial retroactivity only permit queries at the present time, while full retroactivity allows queries at any time. A third kind of retroactivity, non-oblivious retroactivity, lets the user put a series of queries on the time line, and whenever an update is performed, the data structure reports the first query whose value has changed. For partial and full retroactivity, there are two easy cases. The first one concerns commutative and invertible updates. The second, being slightly more complex, concerns decomposable search problem [Bentley & Saxe — J. Alg 1980], which is solved using a segment tree. The harder situation with general transformation [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] can be solved naively using rollback. Concerning general transformation's lower bound, it has been proven that Ω(r) overhead can be necessary [Frandsen, Frandsen, Mitlersen — I&C 2001]. The same paper also proves a Ω((r/lg r)1/2) lower bound for cell-probe model. Cell probe model is "a model of computation where the cost of a computation is measured by the total number of memory accesses to a random access memory with log n bits cell size. All other computations are not counted and are considered to be free." [NIST] A better lower bound for cell probe model, such as Ω(r/poly lg r), is open. With the help of a geometric representation, we show a partially retroactive priority queue with O(lg n) insert and delete-min [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] . Other data structures mentioned include queue, deque, union-find, a fully retroactive priority queue with O(m1/2 lg m) per op and the successor problem [Giora & Kaplan — T. Alg 2009]. A better fully persistent priority queue is open. The best successor problem solution requires fractional cascading from lecture 3 and the van Emde Boas data structure from lecture 11. For non-oblivious retroactivity [Acar, Blelloch, Tangwongsan — CMU TR 2007], a priority queue is shown with the help of a geometric representaiton.

L03 Feb. 23 [+] Geometric: point location via persistence, dynamic via retroactive; orthogonal range queries, range trees, layered range trees, dynamizing augmentation via weight balance, fractional cascading Scribe Notes [src] Erik's Notes Video

In this lecture, the first of two about geometric data structures, we'll talk about two major problems—point location and range searching—and tie them to several major data structural techniques: persistence, retroactivity, dynamization of augmentation through weight balance, and fractional cascading. In (planar) point location, we need to preprocess a planar map to support queries of the form "which face contains this point?" This problem is useful for figuring out where you clicked in a GUI, or for figuring out what city/country/etc. contain your GPS coordinates. The static (preprocessing) version of this problem can be solved optimally (in logarithmic time per query) by applying persistence to a classic sweep-line technique. The dynamic version, where the map can be modified, can be solved for orthogonal maps in optimal logarithmic time per operation using retroactive successor from Lecture 2. In (orthogonal) range searching, we need to preprocess n points in d dimensions to support queries of the form "which points are in this box?" Simple range trees solve this problem in O(logd n) query time and O(n logd−1 n) space. A cool crosslinking technique (called layered range trees) reduce the query time to O(logd−1 n). A general technique for dynamizing augmented search trees, called weight-balanced trees, allows us to additionally support insertion and deletion of points in O(logd n) time per update. The crosslinking technique is a part of a more general technique called fractional cascading, which can further improve query time by another logarithmic factor.

L04 Feb. 28 [+] Geometric: O(log n) 3D orthogonal range searching via fractional cascading; kinetic data structures Scribe Notes [src] Erik's Notes Video

In this lecture, we continue (and finish) the theme of geometric data structures. First, we'll finish our coverage of fractional cascading from Lecture 3 by illustrating a really cool application: 3D orthogonal range searching in O(log n) time. This gives us the second log factor improvement mentioned in Lecture 3, giving O(logd−2 n) for d dimensions. Second, we'll cover a style of data structures for moving data, e.g., where 2D points move at known velocity and acceleration. Although the main study here is in 2D (and 3D), we'll focus on two 1D problems that are well-understood and fit in a lecture: kinetic predecessor and kinetic priority queues.

L05 Mar. 1 [+] Dynamic optimality: binary search trees, analytic bounds, splay trees, geometric view, greedy algorithm Scribe Notes [src] Erik's Notes Video

Have you ever wondered how to represent a binary search tree on n nodes as a set of n points in 2D? Today's lecture will reveal a surprising new (2009) such connection, which provides an approach to tackling the venerable dynamic optimality conjecture. The dynamic optimality conjecture is one of the oldest and biggest open problems in data structures, asking a fundamental question: is there one best binary search tree? Traditional wisdom is that “splay trees” are such a tree (up to constant factors), but we'll see a new “greedy” tree data structure that we think is more obviously best—though we still can't prove it. Dynamic optimality remains an active and exciting area of research, and this lecture and the next focus on the current state-of-the-art.

L06 Mar. 6 [+] Dynamic optimality: independent rectangle, Wilber, and Signed Greedy lower bounds; key-independent optimality; O(lg lg n)-competitive Tango trees Scribe Notes [src] Erik's Notes Video

In this lecture, we'll see the best binary search tree we know, in the sense of achieving the best known competitive ratio of O(lg lg n) compared to the offline optimal. To analyze these “Tango trees”, we compare against a lower bound. Specifically, we describe a Signed Greedy algorithm that, for a given access sequence, computes a number of node touches that every BST in the world must perform when given that access sequence. Indeed, the Signed Greedy algorithm adds points row by row just like the Greedy algorithm we saw last lecture. The only catch is that Signed Greedy doesn't actually satisfy the point set, so it doesn't correspond to a BST… yet it is tantalizingly close to Greedy, which we saw last lecture corresponds to an online BST, suggesting that the two bounds are within a constant factor of each other (making Greedy dynamically optimal). While this gap hasn't been closed yet, we can still use the lower bound to analyze Tango trees, and get the exponential improvement over the trivial O(lg n) competitive ratio held by any balanced binary search tree.

The topic of the next three lectures is cache-efficient data structures. A classic result here is that B-trees are good at exploiting that data is transferred in blocks between cache and main memory, and between main memory and disk, and so on. B-trees achieve O(log B N) insert/delete/predecessor/ successor for N items and memory block transfers of size B. What's more recent and surprising is that you can achieve the same performance even if you don't know what B is, or in other words, simultaneously for all architectures with all values of B. This is a result in the “cache-oblivious” model, and we'll see how to do it over the next two lectures. Cache-oblivious data structures show some exciting practical promise, too, as they form the foundation of a database startup, Tokutek.

This lecture continues our theme of cache-oblivious data structures. First we'll finally cover the black box we used last lecture to obtain cache-oblivious B-trees: maintain an ordered file with O(1)-size gaps in O(lg2 N) moves per insert/delete in the middle of the file. As an extra bonus, we'll see how to maintain a linked list subject to order queries in O(1) time per insert/delete, which is a black box we used back in Lecture 1 to linearize the version tree when implementing full persistence. Second we'll cover cache-oblivious priority queues, supporting insert and delete-min in what's usually subconstant time per operation (!), O((1/B) log M/B (N/B)). This will be the first time we see how to adapt to the cache size M, not just the block size B, in a cache-oblivious data structure.

L09 Mar. 15 [+] Memory hierarchy: distribution sweeping via lazy funnelsort; cache-oblivious orthogonal 2D range searching: batched and online Scribe Notes [src] Erik's Notes Video

Our third and final lecture on memory hierarchies is a fun crossover between cache-oblivious data structures and geometric data structures. We'll start with an optimal cache-oblivious sorting algorithm (something we left as a black box in Lecture 8), called Lazy Funnelsort, though we'll skip the analysis, as it's similar to the priority queue. Then we'll see how to combine that algorithm with the sweepline technique from computational geometry to solve batched geometric problems. Specifically, we'll see how to solve batched orthogonal 2D range searching in the sorting bound using this “distribution sweeping” technique. Finally, we'll cover a series of algorithms for regular online orthogonal 2D range searching, including an impressive linear-space cache-oblivious data structure for 2-sided (quarterplane) range queries, as well as the trick for shaving off a log log factor of space from a regular 4-sided range search data structure.

L10 Mar. 20 [+] Dictionaries: universal, k-wise independent, simple tabulation hashing; chaining, dynamic perfect hashing, linear probing, cuckoo hashing Scribe Notes [src] Erik's Notes Video

Today's lecture gives a fairly thorough overview of different approaches to hashing. First, we talk about different hash functions and their properties, from basic universality to k-wise independence, to a simple but effective hash function called simple tabulation. Then we talk about different approaches to using these hash functions in a data structure: basic chaining, (dynamic) perfect hashing, classic but tricky-to-analyze linear probing, and the more recent cuckoo hashing.

This lecture marks our full entry into integer data structures (though hashing was also one), as well as our first of three lectures on the predecessor problem: supporting insert, delete, and predecessor/successor on a set of n w-bit integers. I'll give an overview of what's known about this problem, as well as the appropriate models of computation, and then proceed to the first main result: data structures achieving O(lg w) time per operation. When w is polylogarithmic in n (a common case), this bound is O(lg lg n), an exponential improvement over binary search trees. The first such data structure is called (and by) van Emde Boas, and it uses Θ(2w) space. A little addition of hashing reduces the space to optimal Θ(n). We'll also see simpler ways to achieve the same bounds, using a simple tree view, indirection, and structures called x-fast and y-fast trees by Willard (who was actually the first to get Θ(n) space).

This lecture describes a tour-de-force in integer data structures, called fusion trees (so named because they were published a year after the 1989 cold-fusion "scandal", and were perhaps just as surprising--though more correct). Fusion trees solve predecessor and successor among n w-bit integers in O(log w n) time per operation on the word RAM. The basic idea is to build a B-tree with branching factor wε. The tricky part is to build a wε-size node supporting constant-time predecessor/successor. We'll see three major techniques for doing so. First, sketching reduces the w1+ε bits in a node down to w “essential” bits, by reducing each word down to w1−ε “essential” bits. The impressive part is sketching a word in constant time using a single multiplication. Second, parallel comparison lets us compare all of these sketches with a single query in constant time. Third, we'll see how to find the most significant set bit of a w-bit word in constant time on a word RAM, by using most of the fusion techniques again; this problem has tons of applications beyond fusion trees as well. (As a result, most significant set bit is a built-in instruction on most architectures; see e.g. StackOverflow.)

This lecture is our first (of two) about integer data structure lower bounds. In particular, we'll prove that the min of van Emde Boas and fusion trees is an optimal (static) predecessor data structure up to a log log factor, assuming polynomial space. The exact trade-off (up to constant factors) is known, but messier; we'll see the bound but not prove it. In particular, it is known that the log log factor improvements are actually impossible to achieve with roughly linear space, making the min of van Emde Boas and fusion trees actually optimal. These lower bounds hold in the powerful cell-probe model of computation, where we just count the number of words read by the query algorithm. The proof uses a beautiful technique called round elimination, in a communication-complexity view of data structures, and is actually quite clean and simple given an appropriate lemma.

L14 Apr. 10 [+] Integer: sorting in linear time for w = Ω(lg2+ε n), priority queues Scribe Notes [src] Erik's Notes Video

This lecture is about the state-of-the-art in sorting and priority queues on a word RAM. An equivalence by Thorup shows that any sorting algorithm can be transformed into a priority queue with operations taking 1/nth the time to sort. So these are really one and the same problem. The best results we know for sorting in linear time (and thus for constant-time priority queues) is when w = O(lg n) and when w = Ω(lg2+ε n). The first result is just radix sort. The second result is the main topic of the lecture: a fancy word-RAM algorithm called signature sorting. It uses a combination of hashing, merge sort, and parallel sorting networks. The range of w in between lg and lg2+ε remains unsolved. The best algorithm so far runs in O(n √lg lg n) expected time.

It's amazing what you can do in constant time using only linear space. This lecture is about such data structures for jumping around (static) rooted trees. Given two nodes, we'll see how to find their least common ancestor in constant time. Given a node and a number k, we'll see how to find the kth ancestor of the node in constant time. Both data structures use a nice collection of ideas, including a powerful trick we haven't used before: lookup tables for small inputs. Along the way, we'll solve another seemingly unrelated problem: preprocess an array of numbers to support queries of the form: what's the smallest number in a given interval of the array? Again, we'll show that constant-time queries are possible using only linear space.

L16 Apr. 19 [+] Strings: suffix tree, suffix array, linear-time construction for large alphabets, suffix tray, document retrieval Scribe Notes [src] Erik's Notes Video

This lecture is about efficient data structures for searching in static strings. Think of the strings we're searching in as (large) files, or entire disks. First we'll see how to store a bunch of strings in linear space subject to searching for a string of length P in O(P) time, and answering predecessor queries for a string of length P in O(P + lg Σ) time, where Σ is the size of the alphabet. Then we'll see how to store one string (or more) in linear space subject to determining whether it has a given substring of length P in O(P) time, counting the number of such substring occurrences in O(P) time, and listing k occurrences in O(P + k) time. Finally, we'll see how to build this data structure in linear time. We'll make use of several tools developed in previous lectures: weight-balanced binary search trees (similar to L3), leaf trimming (L15), Cartesian trees (similar to L15), range minimum queries (L15), least common ancestor (L15), and level ancestor (L15).

This lecture is the first of two about succinct data structures—data structures using very close to the minimum amount of space (“just the data”). In particular, we'll see how to store an n-node binary trie in 2n + o(n) bits (which is optimal up to the little-oh term) and still be able to navigate to a node's left child / right child / parent in constant time per operation, as well as be able to compute the size of the subtree rooted at a node in constant time. Along the way, we'll see how to store an n-bit string in n + o(n) bits (which is again optimal up to little-oh) and be able to compute the number of 1 bits left of a given bit, and find the ith 1 bit, in constant time per operation. These data structures form the basis of many succinct data structures.

This lecture continues our theme of succinct data structures. Building on our expertise with succinct tries, we'll continue on to the practical problem of indexing text with suffix arrays/trees, where several compact and some succinct results are known, with reasonably small (though suboptimal) slowdowns in query time. We'll cover a simple version (not the best) of the historically first such result, which is a compact suffix array that incurs a logε overhead in queries. Then we'll see how to turn any suffix array into a suffix tree with very little (succinct) additional space overhead. We'll get to use our friends (rank, select, balanced parentheses) from last lecture.

This lecture is about a cool data structure for maintaining rooted trees (potentially very unbalanced) in O(log n) time per operation. The operations include linking two trees together by adding an edge, and cutting an edge to split a tree into two trees, so the data structure is called link-cut trees. This result is our first solving a dynamic graph problem (where in general we can insert and delete edges); next lecture will see other solutions for trees and generalization from trees to undirected graphs. Link-cut trees have specific advantages in that they can represent information over root-to-node paths; for example, they can easily compute the min/max/sum of values stored in nodes or edges along any root-to-node paths. Most importantly, link-cut trees introduce two powerful tree decompositions: preferred-path decomposition (which we already used in Tango trees) and heavy-light decomposition. As we will cover them, link-cut trees also demonstrate how a surprisingly “care-free” use of splay trees is actually still efficient.

L20 May 3 [+] Dynamic graphs: Euler tour trees, decremental connectivity in trees in O(1), fully dynamic connectivity in O(lg2 n), survey Scribe Notes [src] Erik's Notes Video

This lecture continues our theme of dynamic graphs. Beyond surveying results for several such problems, we'll focus on dynamic connectivity, where you can insert and/or delete edges, and the query is to determine whether two vertices are in the same connected component (i.e., have a path between them). We'll cover a few different results for this problem: Link-cut trees (from last lecture) already solve trees in O(lg n). Euler-tour trees are a simpler way to solve trees in O(lg n), which allow us to aggregate over subtrees instead of paths. If we just insert or just delete edges, we can solve trees in O(1), using our friend leaf trimming plus some simple bit-vector tricks. For general graphs, we'll see how to support updates in O(lg2 n) and queries in O(lg n / lg lg n). This will be our culmination of data structures for dynamic graphs; the next (final) lecture is about matching lower bounds.

L21 May 8 [+] Dynamic graphs: Ω(lg n) lower bound for dynamic connectivity Scribe Notes [src] Erik's Notes Video

This lecture covers our second of two lower bounds. This one is work by Mihai Pătraşcu (former 6.851 TA) and myself. We'll show that maintaining a graph subject to edge insertion, edge deletion, and connectivity queries (are v & w connected by a path?) requires Ω(lg n) time per operation, even if the graph is just a bunch of paths. This in particular proves optimality of the logarithmic dynamic tree data structures, and shows that the O(lg2 n) data structure we saw for general graphs is pretty good. The lower bound introduces a new but very simple technique, which at the time was the first “truly logarithmic” lower bound for a natural problem, yet the whole proof is relatively clean.

L22 May 19 [+] History of memory models: idealized 2-level, red-blue pebble game, external memory, HMM, BT, (U)MH, cache oblivious [bonus lecture] Erik's Notes Video