Changes made to Native Client (NaCl)
This is a list of changes involved in porting glibc to Native Client.
General approach
The approach I have taken in porting glibc to NaCl is to start out by building glibc for Linux using nacl-gcc, and gradually change it to work with NaCl. Rather than starting out with no Linux-specific code, I have just removed Linux-specific code when it has become a problem.
Steps:
Build statically-linked version of glibc. Make some changes to get glibc to build with --disable-shared (which is rarely used with glibc these days and hence doesn't normally work).
- Resulting test program segfaults in glibc initialisation code. Some necessary fix ups:
- glibc expects to find an auxv in the stack after the argv. Change sel_ldr to provide an empty auxv.
glibc wants to call a Linux syscall to set up the %gs register (used for TLS). Change to use analogous NaCl syscall.
After limited fix ups, the executable works (in sel_ldr's debug mode) but is using some Linux syscalls. The Linux syscalls being made are visible in strace output. Simple Linux syscalls such as exit work, but syscalls involving memory addresses will fail.
- It is then a case of fixing each verifier failure one by one.
Fixing disallowed instructions
The glibc build started out with a number of instructions that are disallowed by the NaCl verifier.
ret (must use popl %ecx; nacljmp %ecx instead)
- Removed some i386-specific files from the source tree. Fall back to unoptimized versions.
- Did search-and-replace in other cases.
system calls (int $0x80)
Many syscall functions are generated in glibc by make-syscalls.sh, so changed this script to generate stubs.
- Other cases use ENTER_KERNEL, so changed this macro to return ENOSYS.
bts and btr: a micro-optimization for setting/unsetting bits in FD_SET/FD_CLR in sysdeps/i386/bits/select.h. Fixed by removing this header file.
jumping into the middle of instructions: this is done as a micro-optimization to skip lock prefixes on some instructions when there is only one thread. Fixed by making the lock unconditional.
exotic instructions, e.g. ld.so uses ud2 (two byte long undefined instruction) to abort the process when hlt would have presumably done the job.
Calling convention
ld.so uses a special form of the ret instruction, ret $0xc, in its implementation of lazy linking. It uses it to preserve all registers when jumping to the lazily linked function. The instruction pops some values off the stack and then does a ret. NaCl cannot allow this instruction because it does not force the return address to be masked. Unfortunately, the NaCl replacement for ret always overwrites a register. This means that NaCl cannot support calling conventions in which all registers are preserved. gcc's regparm function attribute could generate code that does not work.
ncval-stubout
The remaining verifier failures are fixed by a tool called ncval-stubout, which simply replaces disallowed instructions with hlt instructions. I realised that fixing the remaining failures by changing the glibc source by hand would not be a good use of my time. I would be changing code which in many cases would not need to be run yet. By changing code without running it I could be introducing mistakes which make debugging harder in the future. By using ncval-stubout, the problematic code will segfault when run, and it can be fixed up as needed.
Fixing instruction block alignment
Function addresses must be 32-byte-aligned in NaCl. The proper way to achieve this is for ELF object files to declare that their sections require 32 byte alignment. Initially the NaCl assembler did not make 32 byte alignment the default. The way to get the assembler to set the alignment is to add .p2align NACLENTRYALIGN to assembler files, which is what upstream NaCl's newlib patch does; also nacl-gcc puts in these declarations by default. glibc contains a lot of assembler files, however, and I did not want to change them all to add this declaration. Instead, I changed binutils so that the assembler marks ELF .text sections as needing 32 byte alignment.
My earlier fix for fixing alignment was to add SUBALIGN specifiers to the linker script. For example,
.fini :
{
KEEP (*(.fini))
} =0x90909090
becomes
.fini : SUBALIGN(32)
{
KEEP (*(.fini))
} =0x90909090
Fixing padding
We need to ensure that when the linker pads a code section out to a multiple of 32 bytes, it pads it with nop instructions rather than zeros. (Zero bytes decode to multi-byte x86 instructions which usually end up straddling instruction blocks, causing the verifier to reject the code.) The "0x90909090" fill pattern in the linker script fragment above achieves this (0x90 is the opcode of the x86 nop instruction). This fill pattern is already present in binutils' i386 linker scripts, but it only applies to the section names that are listed explicitly in the linker script.
Allowing mmapping of code in sel_ldr
In order to allow dynamic linking, sel_ldr has been extended so that the mmap syscall can map executable code.
The executable regions in the NaCl process's address space are no longer contiguous; they will be interspersed with data mappings. This means that NaCl's previous x86 segment layout - where the x86 code segment is a subset of the x86 data segment - can no longer be used. Instead, the x86 code and data segments map to disjoint ranges of address space. This has the downside of halving the maximum amount of address space NaCl could present to NaCl processes. However, on hardware with the NX bit we could combine the x86 code and data segments again and use the NX bit to control which pages are executable.
The mmap implementation is not yet safe. In order to be safe it will do the following:
mmap some memory outside the NaCl sandbox's address space and read the code into that.
- verify the code.
- set the memory to be read-only with mprotect().
- (on Linux) use mremap() to move the code mappings into the x86 code segment.
NaCl should refuse to unmap code or map over existing code mappings.
I will have to investigate whether Windows has an equivalent to mremap().
This is assuming that mremap() is atomic, or at least that the individual pages are mapped atomically.
If Windows does not provide such an operation we can at least still do this in the single-threaded case.
On Linux, in some situations we can safely mmap the code from the file directly so that the memory is shared between processes, if we know that the file will never be modified. The issue is that Linux does not support full copy-on-write. If a page is mapped into a process with MAP_PRIVATE, writes that the process makes to the mapped page are not reflected in the file, but writes to the file can be reflected in the mapped page.
Code/data split
Normally, in Linux binaries, code and read-only data are combined into the same ELF segment, and so are mmapped together. This allows the kernel to track one mapping instead of two (a small saving). It means that the read-only data is mapped as executable even though it doesn't need to be. In NaCl, the data cannot be mapped as executable (it is very unlikely to pass the verifier). The linker scripts have to be changed to split the code and read-only data into two segments.
A further problem is the ELF headers (PHDR). These are normally mapped as part of the code segment too. ld.so wants to mmap them for the bookkeeping information they contain. They are also exposed to other libraries via _dl_iterate_phdr. We need to put them into a non-executable data segment. Ideally they could go into the normal read-only data segment, but the linker does not make their placement configurable, so I created a new PT_LOAD segment for PHDRs. Maybe the rodata segment could be moved to the start of the binary, immediately after PHDRs, in order to combine the mappings.
Padding code to page boundaries
The ELF code segment needs to be padded to a page boundary (4k), because of the need to separate code and data pages. This assumes that we are going to mmap the code segments though. Actually, the page size on Windows is 64k rather than 4k for dubious historical reasons. If we are not going to mmap code directly from files, this requirement goes away.
The linker script fragment for this padding looks like this:
.text : SUBALIGN(32)
{
*(.text .stub .text.* .gnu.linkonce.t.*)
KEEP (*(.text.*personality*))
/* .gnu.warning sections are handled specially by elf32.em. */
*(.gnu.warning)
/* Putting .fini here makes the align pad correctly when .fini is empty.
Listing the __libc* sections is also necessary to make padding work. */
KEEP (*(.fini))
*(__libc_freeres_fn)
*(__libc_thread_freeres_fn)
. = ALIGN(CONSTANT (MAXPAGESIZE)); /* nacl wants page alignment */
} =0x90909090
The assignment to . needs to be put inside the section description, otherwise the padding does not get filled with nops. Also, this padding does not work properly if the output section is empty, or if there are text sections that are not listed in the linker script, so I have had to put more sections than usual into the description of .text.
Linker scripts
Linker scripts are configuration files normally found in /usr/lib/ldscripts. The standard linker scripts are generated by shell scripts in binutils. To make NaCl-specific changes quickly I have edited them by hand.
ld can find linker scripts by searching the library path (as specified by -L), but normally the default linker scripts are compiled in to ld in which case it doesn't bother to search. (It is also possible to specify the linker script with -Wl,-T,script, but this can't make use of ld's ability to find the right script for the link type (static vs. dynamic).) I disabled the compiling-in, so that I could change the linker scripts independently of binutils.
binutils: PLT entries
binutils had to be changed so that ld generates PLT entries that are usable in NaCl. A PLT (Procedure Linkage Table) entry is a stub used for calling a function in another library.
A normal i386 PLT entry (for position independent code) looks something like this:
$ objdump -d /lib/libc.so.6 ... 000161c8 <malloc@plt>: 161c8: ff a3 18 00 00 00 jmp *0x18(%ebx) 161ce: 68 18 00 00 00 push $0x18 161d3: e9 b0 ff ff ff jmp 16188
The first instruction is the fast case. It loads a function address from the GOT (the Global Offset Table) and jumps to it. (In position independent code, the calling function will have set up %ebx to point to the GOT.) The other two instructions are the slow case, used during lazy binding. When using lazy binding the dynamic linker will set the GOT entry so that it points to the second instruction of the PLT entry. This pushes the GOT index onto the stack and jumps to the dynamic linker's fixup routine, which fills out the GOT entry so that future calls through this PLT entry are fast. Notice that the PLT entry fits neatly into 16 bytes.
The NaCl PLT entry looks like this:
$ nacl-objdump -d install/lib/libc.so.6
...
00001100 <malloc@plt>:
1100: 8b 8b 18 00 00 00 mov 0x18(%ebx),%ecx
1106: 83 e1 e0 and $0xffffffe0,%ecx
1109: ff e1 jmp *%ecx
110b: f4 hlt
110c: f4 hlt
110d: f4 hlt
...
1120: 68 18 00 00 00 push $0x18
1125: e9 d6 fe ff ff jmp 1000
112a: f4 hlt
112b: f4 hlt
112c: f4 hlt
...
Rather than doing an indirect jump to an address loaded from memory, we have to do a nacljmp, and we have to load the address into a register (we overwrite the register and cannot restore it). The two instruction sequences have to be padded with hlts to 32 bytes, so the whole PLT entry takes 64 bytes. Note that to see the <malloc@plt> label at the correct address in the disassembly, you must use nacl-objdump instead of the host system's objdump; the latter assumes 16 byte PLT entries.
binutils: rewriting TLS accesses
The ELF ABI for TLS (Thread Local Storage) involves rewriting generic code sequences for accessing TLS variables into more specific and efficient code sequences. The 12-byte General Dynamic (GD) code sequence involves a function call to ___tls_get_addr:
leal x@tlsgd(,%ebx,1),%eax call ___tls_get_addr@plt
Inside libc.so, this would get rewritten by ld to the Initial Exec (IE) code sequence:
movl x@gotntpoff(%ebx),%eax movl %gs:(%eax),%eax
The problem is that the NaCl assembler inserts no-op instructions before the call instruction, because calls must end on a 32-byte block. The linker ends up overwriting the no-ops instead of overwriting the call, leaving invalid code. To address this I disabled the linker's rewriting of GD code sequences.
The linker will give a warning like this when it is overwriting instruction opcodes that it did not expect:
.../nacl/bin/ld: BFD (GNU Binutils) 2.18 assertion fail ../../binutils-2.18/bfd/elf32-nacl.c:3098
Raised as NaCl bug 237.
binutils fixes
There was a bug in gas's assembly of references to the GOT (Global Offset Table).
gcc fixes
I hit one case in which nacl-gcc was not generating code suitable for NaCl: the use of computed gotos in stdio-common/vfprintf.c. gcc was outputting an indirect jump without using nacljmp. Passing -dP to gcc reveals which machine description rule is used to generate each instruction. Fixing this was just a matter of changing the *indirect_jump rule.
