ELF chainloader notes
Overmapping the chainloader
The chainloader is not relocatable; it is built to be loaded at a fixed address, and this is the same default fixed address that normal executables use (0x08048000 on i386 and 0x00400000 on amd64). This saves us from having to do the relocation tricks that ld.so does on startup. As a happy accident, it means that when ld.so loads the main executable, the memory mappings overwrite the chainloader's mappings, so the chainloader does not clutter up the address space. However, the downside is that the chainloader cannot load fixed-address static executables, although we do not yet have a need for that. There may be unusual programs such as Wine that require changing the chainloader.
rtldi goes to some unusual lengths to try to be relocatable, but in the end it is loaded at a fixed address (0x10000), which is set using a custom linker script (rtldi.lds). It avoids using C string literals but this does not appear to be necessary.
Segment gap on amd64
Actually, the chainloader is not overwritten on amd64, because executables there have a 2MB gap between the end of the code segment (which starts at 0x00400000) and the start of the data segment (>0x00600000). The reason for having a gap is to avoid having to map the trailing end of the code segment as writable (which is what happens on i386), where it shares a page with the data segment. This stops the program from accidentally writing to that part of the code segment.
This is referred to in section 5.1 of the x86-64 ABI: "To save space the file page holding the last page of the text segment may also contain the first page of the data segment. The last data page may contain file information not relevant to the running process. Logically, the system enforces the memory permissions as if each segment were complete and separate; segments' addresses are adjusted to ensure each logical page in the address space has a single set of permissions. In the example above, the region of the file holding the end of text and the beginning of data will be mapped twice: at one virtual address for text and at a different virtual address for data."
The gap appears to be implemented by the following line of the linker script (/usr/lib/ldscripts/elf_x86_64.x):
.ldata ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
It rounds up the position to the next MAXPAGESIZE multiple, and adds on the old position modulo MAXPAGESIZE. That is the same as adding MAXPAGESIZE (but more tortuous). -- No, it adds MAXPAGESIZE, unless the position is already a multiple of MAXPAGESIZE, in which case it adds nothing.
What's not clear is why the increment is 2MB (MAXPAGESIZE) rather than 4k (the actual page size, which would leave no gap). Perhaps it is to make the linker script more generic. Note that ELF libraries on amd64 specify an alignment of 2MB ("2**21" in objcopy), which the system ignores. mmap() doesn't provide a way to ask for an alignment anyway.
-- I think the reason is that it makes the executable (not just the linker script) portable to systems that use a 2MB page size.
There is a thread about this on freebsd-amd64, but it doesn't say why the gap exists.
-- Actually, it is the DATA_SEGMENT_ALIGN instruction in the linker script that does this, and it happens on i386 too. See Jakub Jelinek: PATCH: Smarter aligning of data segment.
Also see Problem with AMD64 ld with linker script.
Workaround 1
The chainloader's writable segment remains mapped in that gap. The chainloader shouldn't really need a writable segment at all. Currently it contains errno, which we don't read, and __vsyscall, which we don't write (because we don't use dietlibc's startup code which would initialise it from the auxv; it remains pointing to the default int $0x80 routine). However, __vsyscall is not used on amd64; dietlibc just hardcodes the syscall instruction in this case.
I worked around this by removing the need for the chainloader to have a writable segment. I reimplemented dietlibc's __unified_syscall to not store errno (since we don't read it at present). This means the chainloader is loaded in a single page.
Another way to fix this would be to link with -Wl,-z,max-page-size=0x1000.
Problem: heap doesn't work
The heap is normally allocated by the kernel to appear after the main executable's data segment. The main executable can be a normal fixed-position executable (such as /bin/cat) or it can be relocatable (such as /lib/ld-linux.so.2), in which case the kernel picks a different address from usual (0x56555000 on i386) so that there is plenty of address space for the heap to expand into. Since the heap is directly after the bss, the unused fraction of a page after the bss can be used as part of the heap.
The heap is extended with the brk() system call. If you mmap pages on top of the heap, the heap stops working, and brk() will refuse to allocate more heap space. This is what is happening after the main executable gets loaded on top of the chainloader. As well as blatting out the chainloader, it blats out the heap. Normally this would not be a problem. glibc's malloc() normally uses the heap for small allocations and mmap() for large allocations, but if it fails to allocate space on the heap, it will fall back to using mmap().
However, bash provides its own malloc implementation for some reason, which is implemented using sbrk() (a glibc-provided wrapper around brk()) and doesn't have a fallback to using mmap(). If the heap is not working, bash will not work and will give an error or segfault (depending on the version of bash).
We'll have to make the heap work again. The simplest way is to load the chainloader at a different fixed address. This involves rewriting a linker script. We should abandon the idea of having the chainloader mappings disappear. The chainloader can't unmap itself. The heap will follow on from the chainloader's data segment, although I'm not sure whether ld.so and libc.so will be able to use the otherwise-unused fractional page. (There is a special case in ld.so to check whether the heap is directly after its own bss.) It might still be worthwhile eliminating the chainloader's data segment, just to save 4k.
Making executables relocatable
Part of ld.so's startup is to relocate itself. See the comments in elf/rtld.c:
/* Relocate ourselves so we can do normal function calls and
data access using the global offset table. */ ...
/* Now life is sane; we can call functions and access global data. ... */
Is it possible to produce a position-independent executable that does not require any relocation to be done at run time on startup?
Step 1: Use -shared, even though this is for an executable. Define a function called _start (the entry point).
Step 2: Use -fPIC, as with normal shared libraries. One effect of this is to change how data (such as string literals) is accessed. In a fixed-position executable, the address of a string literal is fixed and is put into a register with a mov instruction that contains the fixed address. In a position-independent executable, we can get the address by adding an offset to the program counter (PC). On i386 the PC cannot be accessed like a normal register, so gcc uses a trick. call pushes PC on the stack, so gcc introduces a function with a name like __i686.get_pc_thunk.bx which moves the PC into a register.
000001a1 <__i686.get_pc_thunk.bx>: 1a1: 8b 1c 24 mov (%esp),%ebx 1a4: c3 ret
Step 3: Use -fvisibility=hidden. By default, when linking shared libraries, functions are given PLT entries, and all calls to a function go via the PLT entry. That makes functions overridable by other libraries. It also means the PLT has to be set up at startup. If you mark a symbol as having hidden visibility, though, it is not put into the PLT.
If you're successful, the resulting ELF object should have no relocations (list them with objdump -R).
I suspect that the comments in ld.so's rtld.c are no longer accurate. They predate the introduction of get_pc_thunk in gcc (which I think replaces use of the global offset table for within-shared-object references) and they predate glibc's use of ELF's hidden visibility symbols. ld.so still needs relocating though: malloc() and co are defined in ld.so but later get overridden by libc.so, so they need PLT entries.
Would -pie and -fPIE do instead of -shared and -fPIC? This removes the need for -fvisibility=hidden. gcc's docs are not clear on how PIE differs from PIC. Linking with -pie adds a PT_INTERP entry so the resulting executable will use ld.so and is not standalone. Linking with -pie or -shared adds writable sections for dynamic linking. Is it possible to produce a "static PIE" executable? (-shared and -static can be combined but this changes the meaning of -static.) This page on Hardened Gentoo mentions "static PIE" executables. It may be possible to link with -static and change the ELF file type from ET_EXEC to ET_DYN afterwards to mark it as relocatable. Maybe that is what Hardened Gentoo's old -y etdyn option did.
