Whilst in the process of writing a webassembly VM for a personal project, an Undo Dev came across a strange bug that they couldn’t figure out. Using reversible debugging however, they were able to identify the root cause of the problem, and work back to the line of code that caused the bug.

The process of writing the VM requires iterating through a sequence of opcodes and keeping track of various places (aka program counters) that are potential destinations of jumps. These locations are termed "labels".

When certain opcodes (block, loop, if, etc.) are encountered, a label is recorded, by pushing the program counter onto a stack. When other opcodes (end, etc.) are seen, the label is no longer needed and is popped from the stack.

In this case, we are using std::deque<int> as the stack, and when we run a bit of the code we get a segfault:

pc=12 Execute: i32.const 1 pc=13 Execute: i32.ne pc=14 Execute: if Condition is 0 Segmentation fault



So, we can fire up undodb, the reversible debugger for Linux, to have a look at what's happening.

Program received signal SIGSEGV, Segmentation fault. 0x000000000042c561 in __gnu_cxx::new_allocator<int>::construct<int, int const&> (this=0x7ffe6f59d150, __p=0x1fc) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/ext/new_allocator.h:120 120 { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); } (udb) bt #0 0x000000000042c561 in __gnu_cxx::new_allocator<int>::construct<int, int const&> ( this=0x7ffe6f59d150, __p=0x1fc) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/ext/new_allocator.h:120 #1 0x000000000042b3c2 in std::allocator_traits<std::allocator<int> >::construct<int, int const&> ( __a=..., __p=0x1fc) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/alloc_traits.h:530 #2 0x000000000042b42f in std::deque<int, std::allocator<int> >::_M_push_back_aux<int const&> ( this=0x7ffe6f59d150) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/deque.tcc:475 #3 0x000000000042a051 in std::deque<int, std::allocator<int> >::push_back (this=0x7ffe6f59d150, __x=@0x7ffe6f59d0e0: 14) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/stl_deque.h:1526 #4 0x00000000004285ee in execute (code=std::vector of length 2971, capacity 2971 = {...}, stack=std::deque with 0 elements, locals=std::map with 0 elements, globals=std::map with 0 elements, module=...) at wasm_execute.cpp:131 #5 0x0000000000428dae in execute (f=..., nparams=0, stack=std::deque with 0 elements, globals=std::map with 0 elements, module=...) at wasm_execute.cpp:244 #6 0x00000000004291ba in start (f=..., globals=std::map with 0 elements, module=...) at wasm_execute.cpp:268 #7 0x00000000004297f3 in wasm::run (module=...) at wasm_execute.cpp:310 #8 0x0000000000404253 in main (argc=2, argv=0x7ffe6f59e0d8) at wast.cpp:248

So we're getting a segfault when adding a label to the stack. a-priori one might imagine that it would be the action of popping from an empty stack that is more likely to cause a segfault. But that's not what's happening. Sadly, the next guess as to what might be trashing the internal pointers in a std::deque would be some sort of awful stack corruption.

Luckily, undodb can help with this. We just need to work out which pointers in the std::deque are invalid, and then work out when they became invalid. Let's have a look at the std::deque itself.

(udb) up 3 #3 0x000000000042a051 in std::deque<int, std::allocator<int> >::push_back (this=0x7ffe6f59d150, __x=@0x7ffe6f59d0e0: 14) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/stl_deque.h:1526 1526 _M_push_back_aux(__x); (udb) print *this $1 = std::deque with -1 elements = {948468568, 32623, 948468568, 32623, 0, 0, 1920300033, -1, -32, 1679958117, 0, 0, 0, 0, 0, 0, 1868770817, -1, -32, 842230056, 0, 0, 0, 0, 0, 0, 1764237313, -1, -32, 1969190505, 0, 0, 0, 0, 0, 0, 544502529, -1, -32, 1852793646, 0, 0, 0, 0, 0, 0, 1701539585, -1, -32, 862529568, 0, 0, 0, 0, 0, 0, 1763844097, -1, -32, 1684633193, 0, 0, 0, 0, 0, 0, 1852385281, -1, -32, 577986413, 0, 0, 0, 0, 0, 0, 691019777, -1, -32, 1679848037, 0, 0, 0, 0, 0, 0, 1970562305, -1, -32, 572548459, 0, 0, 0, 0, 0, 0, 1717986817, -1, -32, 1953721967, 0, 0, 0, 0, 0, 0, 1953654017, -1, -32, 1986947368, 0, 0, 0, 0, 0, 0, 1764237313, -1, -32, 690565408, 0, 0, 0, 0, 0, 0, 1970562305, -1, Cannot access memory at address 0x6f766e69ffffffe0 (udb)

There's a smoking gun if ever there was one. A std::deque with -1 elements? Let's look at the actual structure rather than what udb has pretty printed for us.

(udb) disable pretty-printer 141 printers disabled 0 of 141 printers enabled (udb) print *this $2 = { <std::_Deque_base<int, std::allocator<int> >> = { _M_impl = { <std::allocator<int>> = { <__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, members of std::_Deque_base<int, std::allocator<int> >::_Deque_impl: _M_map = 0xc8e660, _M_map_size = 8, _M_start = { _M_cur = 0xc8e6b0, _M_first = 0xc8e6b0, _M_last = 0xc8e8b0, _M_node = 0xc8e678 }, _M_finish = { _M_cur = 0x1fc, _M_first = 0x0, _M_last = 0x200, _M_node = 0xc8e670 } } }, <No data fields>}

Sadly, it looks like the size of the std::deque is a calculated value (there are no entries equal to -1 in the structure). However, the _M_finish pointers look bogus. Let's guess that they are indicative of the problem. We can set a watchpoint on the pointers which will help us monitor their behaviour in the execution history more easily.

(udb) watch -l this->_M_impl->_M_finish._M_cur Hardware watchpoint 1: -location this->_M_impl->_M_finish._M_cur (udb) watch -l this->_M_impl->_M_finish._M_first Hardware watchpoint 2: -location this->_M_impl->_M_finish._M_first (udb) watch -l this->_M_impl->_M_finish._M_last Hardware watchpoint 3: -location this->_M_impl->_M_finish._M_last

Now we need to travel back in time and see where the corruption occurred. We do this in undodb by using the backwards-continue command, bcont , which will continue running in reverse until the last time our watchpoint was changed.

(udb) bcont Hardware watchpoint 1: -location this->_M_impl->_M_finish._M_cur Old value = (int *) 0x1fc New value = (int *) 0xc8e6b0 0x000000000042b6b3 in std::deque<int, std::allocator<int> >::_M_pop_back_aux (this=0x7ffe6f59d150) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/deque.tcc:535 535 this->_M_impl._M_finish._M_cur = this->_M_impl._M_finish._M_last - 1; (udb) bt #0 0x000000000042b6b3 in std::deque<int, std::allocator<int> >::_M_pop_back_aux (this=0x7ffe6f59d150) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/deque.tcc:535 #1 0x000000000042a1aa in std::deque<int, std::allocator<int> >::pop_back (this=0x7ffe6f59d150) at /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/g++-v5/bits/stl_deque.h:1580 #2 0x00000000004287f0 in execute (code=..., stack=..., locals=..., globals=..., module=...) at wasm_execute.cpp:175 #3 0x0000000000428dae in execute (f=..., nparams=0, stack=..., globals=..., module=...) at wasm_execute.cpp:244 #4 0x00000000004291ba in start (f=..., globals=..., module=...) at wasm_execute.cpp:268 #5 0x00000000004297f3 in wasm::run (module=...) at wasm_execute.cpp:310 #6 0x0000000000404253 in main (argc=2, argv=0x7ffe6f59e0d8) at wast.cpp:248

Bingo! It was a pop from an empty stack that has caused the problem, and we have the exact line ( wasm_execute.cpp:175 ) where this occurred. Let's look at the offending code (with some uninteresting bits removed and replaced with ellipses).

case wasm::opcode_t::_if: { ... labels.push_back(pc); // <- This is line 131, the location of the original segfault. ... if (!condition) { // skip forward to find else if (else_pos != -1) pc = else_pos; else { if (labels.empty()) throw std::runtime_error("Labels underflow"); labels.pop_back(); pc = end_pos; } std::cerr << "Jumping to " << std::dec << pc << "

"; } labels.pop_back(); // <- This is line 175, the location of the real problem. break; }

With the knowledge that it is the second pop_back() that is the source of the problem, it is easy to appreciate the bug. If the "if" condition is false and there is no else clause (i.e. else_pos == -1 ) then we accidentally pop the label twice. Indeed, the final labels.pop_back() shouldn't be there at all and is the result of a sloppy cut-and-paste. On a first casual glance (we do a push at the start of the opcode, and a pop at the end) everything looks ok, but the insight given us by undodb allows the problem to be identified. It is in fact a lot more staightforward to see this error when removing chunks of code and replacing with ellipses!