The undo-history-discarding issue is only half the story, though. There has also been a steady trickle of reports of corrupted undo history, resulting in either errors or (worse still) garbled buffer text when undoing. Unfortunately, I still have yet to be sent a recipe for reliably reproducing this issue, and I can't reproduce it myself. So I'm still shooting in the dark on this one. However, I have a sneaking suspicion a race condition with the Emacs garbage collector might be responsible…

To understand why, you need to understand a little of how Emacs' undo system works behind the scenes, and how undo-tree manages to replace this at the Elisp level, without changing any Emacs c code. Whenever you make any modification to the text in a buffer, Emacs pushes a description of those changes onto the front of the list stored in the buffer-local buffer-undo-list variable. This variable gets special treatment during garbage collection. Whenever GC runs, it checks if the size of buffer-undo-list exceeds undo-limit (or one of the other undo size limits), and deletes enough older undo history from the tail of buffer-undo-list to bring it back within the configured size limits.

Much of this mechanism is hard-coded deep down in Emacs c code, and runs outside the context of the Elisp interpreter. Undo-tree stays well away from this mechanism, and doesn't mess with it at all. In undo-tree-mode , undo history still accumulates in buffer-undo-list . But whenever you call any undo-tree command ( undo-tree-undo , undo-tree-redo , undo-tree-visualize etc.), the first thing it does is to transfer any undo history that has accumulated in buffer-undo-list since the last time, into the undo-tree package's own buffer-undo-tree data structure. In this way, undo-tree doesn't have to reimplement any of the code for recording undo history (which would anyway be very difficult to do from pure Elisp code). Instead, it can rely on Emacs' built-in, deeply embedded, thoroughly road-tested implementation of this.

However, this design means undo history continues to accumulate in buffer-undo-list until the next undo-tree command is called. If the user doesn't run any undo-tree commands for a long period, buffer-undo-list can easily grow larger than undo-limit and get garbage collected before any undo-tree command is called, i.e. before undo-tree even has a chance to see it. It's critical there are no gaps in the undo history; it must contain a continuous, unbroked chain of changes. Otherwise, undoing across a gap in the history would certainly lead to corrupted buffer text at best, errors at worse.

With older undo history still stored in buffer-undo-tree , somewhat more recent history deleted before undo-tree saw it, and the newest history still stored in buffer-undo-list , doesn't this mean undo-tree will inevitably end up corrupting the undo history? Is this the source of the bugs? No! It would be a very poor design if that were the case. The potential GC race condition is more subtle.

Let's think through the scenario where Emacs garbage collects undo history exceeding undo-limit before undo-tree gets to see it. Emacs only ever garbage collects just enough undo history to bring it back within the limit, in order to preserve as much undo history as possible. So the undo data remaining in buffer-undo-list after GC is as much as Emacs is allowed to store, according to the current undo-limit setting. Therefore, the correct thing for undo-tree to do in this scenario is to discard the contents of buffer-undo-tree entirely, and build a fresh undo tree from scrach using just the contents of buffer-undo-list . This freshly built tree will already contain as much undo history as undo-tree is allowed to store, according to the configured undo-limit .

All that's required is some mechanism to detect that Emacs has garbage collected undo history since undo-tree last saw the contents of buffer-undo-list . We could try to check the size of buffer-undo-list against undo-limit (and the other settings). But this would be very unreliable. Emacs only discards undo history in chunks ("undo changesets" in Emacs terminology); it doesn't discard individual changes. A very large changeset that just barely pushes buffer-undo-list over undo-limit would, when discarded, leave a buffer-undo-list that's much smaller than the limit. But the same undo history could equally well have been produced by a sequence of small changes that never pushed buffer-undo-list over undo-limit at all.

Instead, undo tree adds a special indicator entry to the very end of buffer-undo-list . When undo tree transfers undo history from buffer-undo-list to buffer-undo-tree , it checks whether this special indicator is still present or not. Since garbage collection deletes the oldest undo history from the end of buffer-undo-list first, the first thing that gets deleted is this special indicator entry. If it's missing, it means Emacs has discarded some undo history since undo-tree last saw it, and a fresh buffer-undo-tree should be built from the remaining contents of buffer-undo-list .

The undo-tree package's undo-list-transfer-to-tree function is the only interface between Emacs' low-level undo history recording mechanisms, and undo-tree's system. This is quite a simple function: all it does is pop entries off buffer-undo-list , and add them to buffer-undo-tree . It doesn't manipulate the undo data itself in any way. And when it comes to actually performing undo operations, undo-tree doesn't even do these itself! Rather, it passes this same undo data back to Emacs' own primitive-undo function – the very same function that Emacs' vanilla undo system uses. So there's very little scope for corrupting the undo data, and very little code footprint where this could occur. The simple undo-list-transfer-to-tree function is the only plausible candidate.

But Emacs' GC runs outside the Elisp interpreter. It can be triggered at any point whilst running Elisp code, between pretty much any pair of Elisp instructions. What if GC happens to run whilst undo-list-transfer-to-tree is in the middle of popping undo entries off buffer-undo-list ? As long as GC deletes part of buffer-undo-list that comes after the entry undo-list-transfer-to-tree is currently processing, when it resumes after GC undo-list-transfer-to-tree will just notice the missing indicator when it gets to the end of buffer-undo-list , and build a fresh buffer-undo-tree from the undo history it's just processed. But if GC happens to delete part of the changeset undo-list-transfer-to-tree is in the middle of processing, it could end up transferring a partial undo changeset. This could potentially lead to corrupted undo history.

Worse still is if GC deletes part of buffer-undo-list starting from a point before the entry undo-list-transfer-to-tree has already reached. Emacs' mark-and-sweep GC shouldn't delete any elements reachable from Elisp, which in this scenario would include the whole of buffer-undo-list . But that variable is a glaring exception. Its contents are always reachable from Elisp, via buffer-undo-list itself. Yet Emacs needs to delete part of it during GC, nonetheless. So the Emacs garbage collector takes a blunt approach: cons cells get unlinked from buffer-undo-list by GC, regardless of whether something else is holding a reference to them. In this scenario, when it resumes, undo-list-transfer-to-tree will continue processing cons cells that have been unlinked from buffer-undo-list and the behaviour is undefined, or at least undocumented.

Like many race conditions, this one is extremely hard to test. It requires GC to be triggered at just the right point. Moreover, GC runs outside the context of the Elisp interpreter. So from within Elisp or the Elisp debugger, you only get to see the state of buffer-undo-list and undo-list-transfer-to-tree before and after GC, but not during. However, when testing duianto's recipe for reproducing the undo-tree history discarding issue discussed above, I sporadically saw some odd behaviour that didn't make sense from Elisp, and which I couldn't reliably reproduce. And other comments on the same Spacemacs git issue tracker circumstantially hint at GC being implicated. In any case, whether it's responsible for the trickle of undo-tree corruption bug reports or not, this potential GC race condition is still something that needs addressing.

In the latest undo-tree release, I've implemented multiple measures to reduce the likelihood of – and eliminate the effects of – this race condition:

undo-list-transfer-to-tree now manually triggers GC by calling the garbage-collect function, just before it starts processing buffer-undo-list . This should significantly reduce the likelihood of GC occurring again whilst in the middle of processing buffer-undo-list . Previously, undo-list-transfer-to-tree popped undo entries one by one from buffer-undo-list , and built new sections of the buffer-undo-tree data structure as it went along. This gave a relatively large time window when GC could be triggered whilst still in the middle of processing buffer-undo-list . In the new version, undo-list-transfer-to-tree first deep-copies the entire contents of buffer-undo-list to a new variable, then processes the copy. In this way, GC could only affect things if it occurs whilst copying, which should be a singificantly shorter time window. Finally, undo-tree now installs a function in the post-gc-hook which sets an undo-tree-gc-flag variable to t whenever GC runs. undo-list-transfer-to-tree resets this flag to nil just before deep-copying buffer-undo-list , and checks if it's still nil afterwards. If not, it keeps retrying, until it succeeds in copying buffer-undo-list without GC being triggered during copying.

The first two measures should significantly reduce the chance of hitting this GC race condition. The third should elliminate any possibility of bad side-effects, in the unlikely event GC does still run at a bad time.

As it's extremely difficult to reliably reproduce the race condition, let alone be sure if this is what's behind the unreproducible bug reports, it's very hard to test how successful these measures will be before the release. On the other hand, it's also possible that many or all of the bug reports related to the undo history discarding non-bug discussed above, which measures discussed there should alleviate.

The new release is now out on both GNU Elpa and here. Time will tell whether these measures stem the tide (or rather, divert the rivulet).