Segfault using recent Gtk
- Found in: 2008-07-21 in Ubuntu hardy and intrepid, and Debian lenny
This bug was originally found because the gtk-powerbox-hook test case segfaulted. It appears to be a more general case, affecting other Gtk programs.
$ pola-run -fw / -e leafpad (leafpad:1424): GLib-WARNING **: getpwuid_r(): failed due to: Invalid argument. *** glibc detected *** /usr/bin/leafpad: munmap_chunk(): invalid pointer: 0x00002ab2dd42f690 *** ======= Backtrace: ========= /work/plash/plash/lib/libc.so.6(cfree+0x1b6)[0x2ab2d8d3e026] /work/plash/plash/lib/libdl.so.2(dlerror+0x19c)[0x2ab2d87f23ac] /usr/lib/libgmodule-2.0.so.0[0x2ab2d85e9458] /usr/lib/libgmodule-2.0.so.0(g_module_symbol+0xac)[0x2ab2d85e957c] /usr/lib/libgmodule-2.0.so.0(g_module_open+0x3b2)[0x2ab2d85e9bf2] /usr/lib/libpango-1.0.so.0[0x2ab2d7eec0db] /usr/lib/libgobject-2.0.so.0(g_type_module_use+0x6c)[0x2ab2d83c783c] /usr/lib/libpango-1.0.so.0[0x2ab2d7eec1c9] /usr/lib/libpango-1.0.so.0[0x2ab2d7eec289] /usr/lib/libpango-1.0.so.0[0x2ab2d7eef45a] /usr/lib/libpango-1.0.so.0[0x2ab2d7eefa82] /usr/lib/libpango-1.0.so.0(pango_itemize_with_base_dir+0x6c)[0x2ab2d7eefd3c] /usr/lib/libpango-1.0.so.0[0x2ab2d7ef7e7e] /usr/lib/libpango-1.0.so.0[0x2ab2d7ef9041] /usr/lib/libpango-1.0.so.0(pango_layout_get_pixel_extents+0x50)[0x2ab2d7ef9fe0] /usr/lib/libpango-1.0.so.0(pango_layout_get_pixel_size+0x1e)[0x2ab2d7efa05e] /usr/bin/leafpad[0x40b7d1] /usr/bin/leafpad[0x40b879] /usr/bin/leafpad[0x409f9d] /usr/bin/leafpad[0x4090fe] /usr/bin/leafpad[0x408726] /work/plash/plash/lib/libc.so.6(__libc_start_main+0xf4)[0x2ab2d8ce2f14] /usr/bin/leafpad[0x408369]
This appears to be a problem with PlashGlibc because it also occurs with a gutsy plash-pkg environment on top of hardy:
$ echo 'deb http://localhost:9999/ubuntu gutsy main universe' >sources.list $ plash-pkg-update-avail $ plash-pkg-install leafpad -c packaging/examples/leafpad.pkg $ plash-pkg-launch --app-dir leafpad -e /usr/bin/leafpad (leafpad:4116): GLib-WARNING **: getpwuid_r(): failed due to: Invalid argument. *** glibc detected *** /usr/bin/leafpad: munmap_chunk(): invalid pointer: 0x00002b71c3c7d580 *** ======= Backtrace: ========= /lib64/libc.so.6(cfree+0x1b6)[0x2b71c1b91026] /lib64/libc.so.6(__libc_dlsym+0x80)[0x2b71c1c36630] ...
Debugging with valgrind
It's possible to use valgrind with Plash in a limited way. Use run-uninstalled.sh but change tests/wrapper.sh to the following:
#!/bin/sh exec valgrind "$PLASH_LDSO_PATH" --library-path $PLASH_LIBRARY_DIR "$@"
valgrind does not appear to offer a way to set environment variables, so setting LD_LIBRARY_PATH cannot be done; setting it for valgrind itself will interfere with valgrind.
The --db-attach=yes option does not work in many cases (also reported here).
glibc 2.7
If I revert to using PlashGlibc based on glibc 2.7, leafpad no longer dies. When exiting leafpad, valgrind gives:
==21728== Invalid free() / delete / delete[] ==21728== at 0x4A1BB2E: free (vg_replace_malloc.c:323) ==21728== by 0x6CC1FCA: free_mem (in /work/plash/plash/lib/libc.so.6) ==21728== by 0x6CC1BD9: __libc_freeres (in /work/plash/plash/lib/libc.so.6) ==21728== by 0x481231C: _vgnU_freeres (vg_preloaded.c:60) ==21728== by 0x6BD9CEA: exit (in /work/plash/plash/lib/libc.so.6) ==21728== by 0x6BC317A: (below main) (in /work/plash/plash/lib/libc.so.6) ==21728== Address 0x84e0830 is not stack'd, malloc'd or (recently) free'd
(__libc_freeres is called by valgrind as an extra check. glibc contains lots of functions named free_mem.)
If I run 2.7's ld.so with 2.8's libc.so, it works, with the valgrind message above. If I then switch to 2.8's ld.so, there is an error on startup: a segfault, or with valgrind, the following message:
==22727== Invalid free() / delete / delete[] ==22727== at 0x4A1BB2E: free (vg_replace_malloc.c:323) ==22727== by 0x66D23AB: dlerror (in /work/plash/plash/lib/libdl.so.2) ==22727== by 0x64C9457: (within /usr/lib/libgmodule-2.0.so.0.1600.3) ==22727== by 0x64C957B: g_module_symbol (in /usr/lib/libgmodule-2.0.so.0.1600.3) ==22727== by 0x64C9BF1: g_module_open (in /usr/lib/libgmodule-2.0.so.0.1600.3) ==22727== by 0x5DCC0DA: (within /usr/lib/libpango-1.0.so.0.2000.1) ==22727== by 0x62A783B: g_type_module_use (in /usr/lib/libgobject-2.0.so.0.1600.3) ==22727== by 0x5DCC1C8: (within /usr/lib/libpango-1.0.so.0.2000.1) ==22727== by 0x5DCC288: (within /usr/lib/libpango-1.0.so.0.2000.1) ==22727== by 0x5DCF459: (within /usr/lib/libpango-1.0.so.0.2000.1) ==22727== by 0x5DCFA81: (within /usr/lib/libpango-1.0.so.0.2000.1) ==22727== by 0x5DCFD3B: pango_itemize_with_base_dir (in /usr/lib/libpango-1.0.so.0.2000.1) ==22727== Address 0xbf0f690 is not stack'd, malloc'd or (recently) free'd
Test case
The problem is with dlerror. This small test case reproduces the invalid free() call:
#include <dlfcn.h>
int main()
{
dlopen("foo", 0);
dlerror();
return 0;
}
I have been unable to get a useful stack backtrace out of gdb when free() calls enter, and I have been unable to get gdb to set a breakpoint on free(); and valgrind's gdb-invoking feature is not working. So, I have resorted to printf debugging. In ld.so, that becomes _dl_printf debugging.
The problem occurs in the second free() call in dlfcn/dlerror.c's __dlerror(). I added a printf and the string it is trying to free looks fine. The string should be allocated by elf/dl-error.c's _dl_signal_error(), which sets a flag to indicate whether the error string was malloc()'d or not. (Before libc.so has been loaded, ld.so uses its own malloc/free implementation which later gets overridden; they are weak symbols.)
The strange thing is, adding _dl_printf calls into elf/dl-error.c made the problem go away. After I took them out, the problem was still not there. So this looks like a build problem. I rebuilt glibc from scratch, and the problem was still not there.
Comparing build trees shows some differences in config.log. The earlier one contains mentions of -Wl,-Bsymbolic-functions, missing in my newer from-scratch tree. The ld man page says about this option: "When creating a shared library, bind references to global function symbols to the definition within the shared library, if any." This could be causing ld.so to fail to link against libc.so's malloc/free. libdl.so would be trying to free() a block that was allocated by ld.so's private malloc().
Where could -Wl,-Bsymbolic-functions be coming from? I have noticed that dpkg-buildpackage prints messages about setting CFLAGS; it could be adding this option to LDFLAGS in the environment in an attempt to be "helpful" by stopping shared libraries from interacting with each other inadvertently. (config.log does not explicitly record which arguments have come from environment variables.)
Yes, this explanation is correct:
DistCompilerFlags on the Ubuntu wiki documents how dpkg-buildpackage sets LDFLAGS.
Here is an example of a problem that was solved with -Bsymbolic: png2/3 problem apparently successfully solved with -Bsymbolic
Ubuntu's glibc became very broken when this change was introduced; see bug #201673 (which has plenty of duplicates).
According to this forum thread, Ubuntu's glibc was fixed in version 2.7-9ubuntu2. This just gives a changelog message. Where can I find the actual change?
This change (revision 135) contains the fix.
I found the Bazaar branch "glibc-2.5-package" linked from "Ubuntu Toolchain Hackers" (which I found using Google). It is an oddly-named branch, considering that it does not use glibc 2.5 any more. There is a "glibc-2.7-package" branch which appears to be stale. They are not linked from https://launchpad.net/ubuntu/+source/glibc or https://launchpad.net/glibc.
Despite my original report, this is not broken on Debian lenny. Debian has not changed dpkg-buildpackage. Debian seems to be more cautious about these things; they did not change gcc to use -fstack-protector by default either. This page may be relevant: http://wiki.debian.org/Hardening
Original post and patch introducing relro feature: RFC PATCH: Little hardening DSOs/executables against exploits. Also see Ian Lance Taylor's blog post.
How to test for the breakage: In a correct ld.so, malloc is listed in the dynamic relocation table. In a broken ld.so, it is missing:
$ objdump -R /lib/ld-linux-x86-64.so.2 | grep malloc 000000000021d008 R_X86_64_JUMP_SLOT malloc $ objdump -R glibc-build/elf/ld.so | grep malloc
Would this have been caught by glibc's test suite? I don't think there is a test for dlerror(). I think elf/check-localplt checks libc.so but not ld.so.
Lessons for me:
- Believe the stack traces that valgrind gives.
I need to implement StoryTest2 ("Test packages under Plash in bulk").
ContinuousIntegration should do clean builds from scratch, so that I can rule out problems caused by reusing build trees.
General lessons:
Having a build system based on wrappers around wrappers is not a good idea. I'm thinking of dpkg-buildpackage here. Having it set environment variables -- which are not present if you try building "by hand" with ./configure && make -- is a recipe for confusion.
