Gdb

Or the few gdb commands I use all the time because they are enough for 95% cases: run, backtrace, frame, print, list, break, watch, continue. What follows is a real scenario.

A classical problem

The speech synthesis from Android, pico, is of very good quality, but when you build and run it on a 64bit machine, protch, it beautifully crashes:

€ pico2wave -w test.wav foo zsh: segmentation fault pico2wave -w test.wav foo

What can we do? Gdb! Quite often, gdb's answer is enlightening!

First, we have to rebuild the program with the -g option and avoid optimizations which makes debugging often tedious, typically:

€ CFLAGS="-g -O0" ./configure && make

Beware, however, that using -O0 sometimes make the bug disappear! In that case we really have to keep the optimization options, but gdb will maybe have difficulties to show some values.

€ ./pico2wave -w test.wav foo zsh: segmentation fault pico2wave -w test.wav foo

Phew, it still segfaults. That is a very good thing: it is systematically reproducible!

run, backtrace, frame, print, list

A useful tip is to use the --args option from gdb, which permits to simply prepend gdb --args before the command that we just tried.

€ gdb --args ./pico2wave -w test.wav foo GNU gdb (GDB) blabla blablabla ... (gdb)

gdb is ready, we simply type r (shortcut for run )

(gdb) r Starting program: /tmp/svox-1.0+git20100205/pico/.libs/lt-pico2wave -w test.wav foo Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7711579 in picoos_deallocate (this=0x7ffff7285470, adr=0x7ffff736d100) at lib/picoos.c:579 579 if (cr->size > 0) { (gdb)

Here is a very interesting line! The segfault can a priori only come from dereferencing cr, let's use p (shortcut for print ) to see cr 's content:

(gdb) p cr $1 = (MemCellHdr) 0x8000f736f9d8

Strange-looking pointer indeed (looking at /proc/$(pidof pico2wave)/maps ), gdb confirms:

(gdb) p *cr Cannot access memory at address 0x8000f736f9d8

Let's see where that comes from thanks to bt (shortcut for backtrace ):

(gdb) bt #0 0x00007ffff7711579 in picoos_deallocate (this=0x7ffff7285470, adr=0x7ffff736d100) at lib/picoos.c:579 #1 0x00007ffff7730462 in sigDeallocate (mm=0x7ffff7285470, sig_inObj=0x7ffff736d090) at lib/picosig2.c:367 #2 0x00007ffff772da10 in sigSubObjDeallocate (this=0x7ffff736ad78, mm=0x7ffff7285470) at lib/picosig.c:238 #3 0x00007ffff76f911f in picodata_disposeProcessingUnit (mm=0x7ffff7285470, this=0x7ffff7286c20) at lib/picodata.c:638 #4 0x00007ffff76f753c in ctrlSubObjDeallocate (this=0x7ffff7286b80, mm=0x7ffff7285470) at lib/picoctrl.c:274 #5 0x00007ffff76f911f in picodata_disposeProcessingUnit (mm=0x7ffff7285470, this=0x7ffff7285448) at lib/picodata.c:638 #6 0x00007ffff76f7c2e in picoctrl_disposeControl (mm=0x7ffff7285470, this=0x7ffff7285448) at lib/picoctrl.c:487 #7 0x00007ffff76f80b5 in picoctrl_disposeEngine (mm=0x7ffff7120030, rm=0x7ffff7120538, this=0x602c60) at lib/picoctrl.c:671 #8 0x00007ffff76effce in pico_disposeEngine (system=0x7ffff7120010, inoutEngine=0x602c60) at lib/picoapi.c:569 #9 0x0000000000401edc in main (argc=4, argv=0x7fffffffdc68) at bin/pico2wave.c:319 (gdb)

Err, cryptic, isn't? In a lot of cases cr could have been a simple parameter of our current function, and we would have had to browse up to the caller thanks to the up command, and continue to use p to check where that comes from, it is apparently not the case here, cr is not a parameter. Let's see around our guilty line inside the picoos_deallocate function ( l shortcut for list ):

(gdb) l 574 this->usedSize -= c->size; 575 576 cr = (MemCellHdr)((picoos_objsize_t)c + c->size); 577 cl = c->leftCell; 578 if (cl->size > 0) { 579 if (cr->size > 0) { 580 crr = (MemCellHdr)((picoos_objsize_t)cr + cr->size); 581 crr->leftCell = cl; 582 cl->size = ((cl->size + c->size) + cr->size); 583 cr->nextFree->prevFree = cr->prevFree;

Err, this code goes against most common programmation rules, there are casts everywhere! So cr comes from c + c->size , let's see those:

(gdb) p c $2 = (MemCellHdr) 0x7ffff736f9d8 (gdb) p c->size $3 = 4294967296

Re-outch, c 's value seems reasonable (we can even dereference it), but the value of c->size is completely incorrect, it is even exactly 0x100000000 ...

A couple of tries on the side

Where does this odd value come from? Either some computation is bogus, or we have an overflow from somewhere else.

Luckily, there are few lines of code which give a value to size , so we can quickly add printfs alongside those. We however only get reasonable values, and never 0x100000000 , the problem is thus elsewhere... What can we do?

valgrind is a buffer overflow specialist, and usually very efficient at detecting pointer and allocation errors. It is however ineffective here: a lot of " Conditional jump or move depends on uninitialised value(s) " left apart, it also finishes on this cr->size line without having given any clue before (that is actually because pico uses its own home-made allocator, which valgrind thus can not debug). Electric-Fence, a more leightweight buffer overflow detection tool, is ineffective here for the same reason.

break, watch

There is still the watch solution: indeed, when executing the program several times with address randomization disabled ( echo 0 > /proc/sys/kernel/randomize_va_space ), we can notice that the address of c->size is always the same:

(gdb) p &c->size $1 = (long int *) 0x7ffff736f9d8

We can just ask gdb to stop when this memory value changes. Let's first restart from zero and stop at the beginning of the main function, before things go bad ( b is a shortcut for breakpoint ):

(gdb) b main Breakpoint 1 at 0x4012ef: file bin/pico2wave.c, line 73. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /tmp/svox-1.0+git20100205/pico/.libs/lt-pico2wave -w test.wav foo Breakpoint 1, main (argc=4, argv=0x7fffffffdc58) at bin/pico2wave.c:73 73 char * wavefile = NULL;

We use watch to tell gdb which value should be monitored:

(gdb) watch *(long int *) 0x7ffff736f9d8 Hardware watchpoint 2: *(long int *) 0x7ffff736f9d8

Take note of Hardware here: we only monitor an integer in memory, gdb could thus ask the processor to do the monitoring itself, which does not cost any performance!

Note: watch also permits to simply observe an existing variable ( watch mavariable ), but here we don't really know when it starts existing...

We can now continue execution (yes, negative size values are "normal"...):

(gdb) c Continuing. Hardware watchpoint 2: *(long int *) 0x7ffff736f9d8 Old value = 0 New value = 40120 picoos_allocate (this=0x7ffff7285470, byteSize=1024) at lib/picoos.c:539 539 c->size = cellSize; Current language: auto The current source language is "auto; currently c". (gdb) c Continuing. Hardware watchpoint 2: *(long int *) 0x7ffff736f9d8 Old value = 40120 New value = 1040 picoos_allocate (this=0x7ffff7285470, byteSize=1024) at lib/picoos.c:540 540 c2->leftCell = c; (gdb) c Continuing. Hardware watchpoint 4: *(long int *) 0x7ffff736f9d8 Old value = 1040 New value = -1040 picoos_allocate (this=0x7ffff7285470, byteSize=1024) at lib/picoos.c:556 556 adr = (void *)((picoos_objsize_t)c + this->usedCellHdrSize); (gdb) c Continuing. Hardware watchpoint 4: *(long int *) 0x7ffff736f9d8 Old value = -1040 New value = -4294967296 memset () at ../sysdeps/x86_64/memset.S:954 954 ../sysdeps/x86_64/memset.S: Aucun fichier ou dossier de ce type. in ../sysdeps/x86_64/memset.S Current language: auto The current source language is "auto; currently asm".

Ah, here is something interesting! Let's see what does a memset and thus overwrites our size field:

(gdb) bt #0 memset () at ../sysdeps/x86_64/memset.S:954 #1 0x00007ffff77151fa in picopal_mem_set (dest=0x7ffff736f63c, byte_val=0 '\000', length=928) at lib/picopal.c:258 #2 0x00007ffff7710bfb in picoos_mem_set (dest=0x7ffff736f63c, byte_val=0 '\000', length=928) at lib/picoos.c:184 #3 0x00007ffff7730cca in mel_2_lin_lookup (sig_inObj=0x7ffff736d090, scmeanMGC=10) at lib/picosig2.c:572 #4 0x00007ffff772e3d6 in sigProcess (this=0x7ffff736ad78, inReadPos=0, numinb=64, outWritePos=0, numoutb=0x7fffffffd758) at lib/picosig.c:540 #5 0x00007ffff772f2d1 in sigStep (this=0x7ffff736ad78, mode=0, numBytesOutput=0x7fffffffd7da) at lib/picosig.c:1104 #6 0x00007ffff76f72ab in ctrlStep (this=0x7ffff736f63c, mode=0, bytesOutput=0x7fffffffd826) at lib/picoctrl.c:153 #7 0x00007ffff76f821e in picoctrl_engFetchOutputItemBytes (this=0x7ffff7285428, buffer=0x7fffffffd9a0 "8\365\330\003", bufferSize=128, bytesReceived=0x7fffffffd99c) at lib/picoctrl.c:762 #8 0x00007ffff76f012c in pico_getData (engine=0x7ffff7285428, buffer=0x7fffffffd9a0, bufferSize=128, bytesReceived=0x7fffffffd99c, outDataType=0x7fffffffd99a) at lib/picoapi.c:650 #9 0x0000000000401d10 in main (argc=4, argv=0x7fffffffdc58) at bin/pico2wave.c:278 (gdb) p/x 0x7ffff736f63c + 928 $5 = 0x7ffff736f9dc

Getting gdb do a little computation, we can see that the size field overwrite was really close: just 4 bytes... This memset is encapsulated in functions to look nicer, so we would have to use up several times to browse up to the caller inside the mel_2_lin_lookup function (3rd in the call stack). We can go faster thanks to frame :

(gdb) frame 3 #3 0x00007ffff7730cca in mel_2_lin_lookup (sig_inObj=0x7ffff736d090, scmeanMGC=10) at lib/picosig2.c:572 572 picoos_mem_set(XXr + m1, 0, i); Current language: auto The current source language is "auto; currently c". (gdb) l 567 XXr[0] = (picoos_int32) ((picoos_single) c1[0] * K1); 568 for (nI = 1; nI < m1; nI++) { 569 XXr[nI] = c1[nI] << shift; 570 } 571 i = sizeof(picoos_int32) * (PICODSP_FFTSIZE + 1 - m1); 572 picoos_mem_set(XXr + m1, 0, i); 573 dfct_nmf(m4, XXr); /* DFCT directly in fixed point */ 574 575 /* ***************************************************************************************** 576 Linear frequency scale envelope through interpolation.

Here is the culprit, which overwrites into our field. We have seen above that it was overwriting by 4 bytes. Looking closer, the memset clears the XXr buffer from byte 4*m1 to byte 4*(PICODSP_FFTSIZE + 1) . Couldn't +1 be superfluous?! Let's have a look at the code which allocates this pointer (actually it comes from a wcep_pI field which is actually called int_vec28 ):

d32 = (picoos_int32 *) picoos_allocate(mm, sizeof(picoos_int32) * PICODSP_FFTSIZE); if (NULL == d32) { sigDeallocate(mm, sig_inObj); return PICO_ERR_OTHER; } sig_inObj->int_vec28 = d32;

+1 indeed seems not appropriate! Or the allocation is not big enough?! Here we should discuss with the author, but in any case here is a clear culprit! Thanks gdb!

Conclusion

You don't actually need to know a lot of gdb commands. The first step: r , bt , l , and p , is actually enough in most cases! For most other cases, a couple of b and watch can help a lot to debunk bugs. You can also try to combine those with reverse-continue .

Appendix

Threads

When debugging a multithreaded program, it is useful to switch between threads:

(gdb) r Starting program: /tmp/test [Thread debugging using libthread_db enabled] [New Thread 0x7ffff7861710 (LWP 10981)] [New Thread 0x7ffff7060710 (LWP 10982)] ^C Program received signal SIGINT, Interrupt. 0x00007ffff7bcabe5 in pthread_join (threadid=140737346148112, thread_return=0x0) at pthread_join.c:89 89 pthread_join.c: Aucun fichier ou dossier de ce type. in pthread_join.c (gdb) bt #0 0x00007ffff7bcabe5 in pthread_join (threadid=140737346148112, thread_return=0x0) at pthread_join.c:89 #1 0x00000000004005fd in main () (gdb)

Here, we are in the main thread, which is simply waiting the threads it created. Let's see the list of threads:

(gdb) info thread 3 Thread 0x7ffff7060710 (LWP 10982) 0x00000000004005ac in f () 2 Thread 0x7ffff7861710 (LWP 10981) 0x00000000004005ac in f () * 1 Thread 0x7ffff7fce700 (LWP 10968) 0x00007ffff7bcabe5 in pthread_join ( threadid=140737346148112, thread_return=0x0) at pthread_join.c:89 (gdb)

And switch between them:

(gdb) thread 2 [Switching to thread 2 (Thread 0x7ffff7861710 (LWP 10981))]#0 0x00000000004005ac in f () (gdb) bt #0 0x00000000004005ac in f () #1 0x00007ffff7bc98ba in start_thread (arg= ) at pthread_create.c:300 #2 0x00007ffff793102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #3 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 0x7ffff7060710 (LWP 10982))]#0 0x00000000004005ac in f () (gdb) bt #0 0x00000000004005ac in f () #1 0x00007ffff7bc98ba in start_thread (arg= ) at pthread_create.c:300 #2 0x00007ffff793102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #3 0x0000000000000000 in ?? () (gdb)

One can also us thread apply all to go faster:

(gdb) thread apply all bt Thread 3 (Thread 0x7ffff7060710 (LWP 10982)): #0 0x00000000004005ac in f () #1 0x00007ffff7bc98ba in start_thread (arg= ) at pthread_create.c:300 #2 0x00007ffff793102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #3 0x0000000000000000 in ?? () Thread 2 (Thread 0x7ffff7861710 (LWP 10981)): #0 0x00000000004005ac in f () #1 0x00007ffff7bc98ba in start_thread (arg= ) at pthread_create.c:300 #2 0x00007ffff793102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #3 0x0000000000000000 in ?? () Thread 1 (Thread 0x7ffff7fce700 (LWP 10968)): #0 0x00007ffff7bcabe5 in pthread_join (threadid=140737346148112, thread_return=0x0) at pthread_join.c:89 #1 0x00000000004005fd in main () (gdb)

Heisenbugs

A Heisenbug is a bug which disappears as soon as one tries to debug it. We talked earlier about the optimisations issues. In the case of a multithreaded program, it often happens that merely starting it in gdb makes the bug disappear.

Either the program crashes, one can use the core which it dumps (use ulimit -c unlimited if the program didn't dump a core).

$ gdb ./monprog core ...

And one can then examine the whole processus the same way as if it was alive, one just can not resume execution.

Or the program hangs. One can force dumping a core by using control-\ (this is like control-C, except that it also creates a core). One can also attach to the living process:

$ gdb ./monprog $(pidof monprog) ...

Misc tips

Here are a few things I use quite often

When using libraries, one needs to have debugging symbols for them as well. See for instance how to install debugging packages on Debian. You will probably want to get symbols for all the libraries your program uses. A quick way to check that is to break on main, run, and then type info sharedlibrary , a (*) will show in front of libraries for which symbols are missing.

, a will show in front of libraries for which symbols are missing. To get the current context, one can use the list command, but one can also press control-x a to get gdb use ncurses to keep the listing at the top of the screen (or pass the -tui option to the gdb command, or type tui enable at the prompt). One can press control-x 2 to cycle between different combinations of C source code, assembly code, and assembly registers. Note that up/down arrows will now move the source instead of looking in history. One can use control-p and control-n to browse the history. wh src -10 can be used to reduce the size of the source window.

command, but one can also press to get gdb use ncurses to keep the listing at the top of the screen (or pass the option to the gdb command, or type at the prompt). One can press to cycle between different combinations of C source code, assembly code, and assembly registers. Note that up/down arrows will now move the source instead of looking in history. One can use and to browse the history. can be used to reduce the size of the source window. It sometimes happens that some tool tells me that some instruction at some address did somebody bad, or something similar which gives me an instruction address (e.g. 0x1234). To know where that is, use (gdb) l * 0x1234 or use the addr2line tool.

or use the addr2line tool. One can print several adjacent memory locations thanks to @ , for instance: (gdb) p t[8]@16 prints the 16 elements starting at t[8]

, for instance: prints the 16 elements starting at Reverse debugging is extremely powerful. In the case shown above, we could have used it to spot the bug: first we run the program with recording enabled € gdb --args ./pico2wave -w test.wav foo (gdb) break main Breakpoint 1... (gdb) run Starting program Breakpoint 1, main () (gdb) record (gdb) continue ... Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7711579 in picoos_deallocate (this=0x7ffff7285470, adr=0x7ffff736d100) at lib/picoos.c:579 579 if (cr->size > 0) { We get the address of the faulty area in the same way as above: (gdb) p cr $1 = (MemCellHdr) 0x8000f736f9d8 ... (gdb) p c->size $3 = 4294967296 (gdb) p &c->size $1 = (long int *) 0x7ffff736f9d8 and now we can reverse-watch this area to see what brought that odd value. We need to disable hardware watchpoint since it's gdb which recorded the changes, and then we can just reverse-continue: (gdb) set can-use-hw-watchpoints 0 (gdb) watch *(long int *) 0x7ffff736f9d8 Watchpoint 3... (gdb) reverse-continue Old value = -4294967296 New value = -1040 and there we are! We can also use reverse-next, reverse-step, etc.

We get the address of the faulty area in the same way as above: and now we can reverse-watch this area to see what brought that odd value. We need to disable hardware watchpoint since it's gdb which recorded the changes, and then we can just reverse-continue: and there we are! We can also use reverse-next, reverse-step, etc. It may happen that the bug doesn't reproduce all the time. It would be tedious to relunach the program several times, but one can automate this thanks to the command command : (gdb) break main Breakpoint 1: ... (gdb) command 1 record continue end (gdb) break _exit Breakpoint 2: ... (gdb) command 2 run end (gdb) set pagination off Here we asked gdb to stop on main to activate recording, and continue execution, and stop on _exit (the normal termination of the process) to simply... restart from zero. One also disables pagination so that it continues in loop: (gdb) r ... will loop until a segfault or another error stops the program!

: Here we asked gdb to stop on main to activate recording, and continue execution, and stop on _exit (the normal termination of the process) to simply... restart from zero. One also disables pagination so that it continues in loop: will loop until a segfault or another error stops the program! There are probably other very powerful ways of using command !

DUEL

dl

dl head-->next->val

Copyright (c) 2011-2013, 2015-2016, 2018, 2020 Samuel Thibault

This text is available under the Creative Commons Attribution-ShareAlike Licence (V3.0), as described on http://creativecommons.org/licenses/by-sa/3.0/ and whose full text is available on http://creativecommons.org/licenses/by-sa/3.0/legalcode . Contact me if the ShareAlike clause poses problem.