Dynamic loading in Native Client
This page has been moved to http://code.google.com/p/nativeclient/wiki/DynamicLoadingOptions
Contents
Introduction
Native Client needs to ensure that only validated code can be executed. This requires a mechanism to separate code and data. If we are to extend Native Client to support dynamic loading of code, there are two classes of interface we might provide for doing it:
InterleavedCode: Code can be loaded with a validate-and-mmap operation which allows code and data to be interleaved in address space. This can be implemented using the no-execute (NX) bit, which is supported by some but not all systems. On older x86 systems that do not support NX page protection, we can use a "Harvard architecture" approach in which the x86 code and data segments are disjoint.
ContiguousCode: A contiguous range of address space (typically at the bottom of address space) is reserved for code. Only validated code can be loaded into this range, and nothing can be executed outside this range.
Current scheme
The current sandboxing scheme supports statically-linked executables only. It uses the following address space layout:
0: bottom 64k, unmapped (technically, this is mapped but with no permission bits set)
0x10000: syscall trampolines (trusted code)
0x20000: executable's code segment (untrusted but validated code)
code_top: executable's data segment
The Native Client process is limited so that the only instructions it can execute are below code_top. There are two mechanisms for doing this:
On x86: via segmentation. The x86 code segment is set to be code_top bytes in size. Jumping outside this segment will cause a segmentation fault.
code_top has to be page-aligned.
- On other architectures: by requiring masking instructions before indirect jumps. Indirect jump instructions must be preceded by a masking instruction that limits the range of the destination address.
This requires that code_top is a power of 2 so that the address range can be restricted with a simple bitmask.
Interface 1: ContiguousCode
This interface scheme adds an extra region into address space into which code can be dynamically loaded. This region appears after the executable's code segment, before its data segment, so that the address space layout becomes:
0: bottom 64k, unmapped
0x10000: syscall trampolines (trusted code)
0x20000: executable's code segment (untrusted but validated code)
dynamic_code_start: dynamic code region (untrusted but validated code)
code_top: executable's data segment
Implementation: CC-HLTRewrite
One proposed implementation for this interface works as follows:
- The size of the dynamic code region is specified implicitly by the initially-loaded executable: it is simply the space between the executable's code and data segments.
- The dynamic code region allocated by sel_ldr on startup and filled with HLT instructions (or the architecture's equivalent of HLT).
sel_ldr maps the region into its address space twice: once into the NaCl process's address space (as read-execute; the "read view"), and again in a location that the untrusted code cannot access (as read-write; the "write view").
- On Unix this double-mapping is done using POSIX shared memory.
When the NaCl process requests loading code dynamically (via a syscall), the runtime does the following:
- Copies the code into a temporary area.
- Runs the validator on the code.
- Checks that the destination region contains HLTs. (Whether this is necessary depends on the strategy for allocating space in the dynamic code region. If we always allocate sequentially, this check is not necessary.)
- Copies the code into the writable view of the dynamic code region, in reverse order: high addresses are copied first, for safety in the presence of multiple threads.
Pros:
Memory mappings for code do not need to be changed after the NaCl process has been started, so we do not need to worry about race conditions involving the underlying OS's mmap calls.
Allows small pieces of code to be loaded (<4k, <64k), so this is suitable for loading JITted code.
Cons:
- Writing and executing code concurrently might trigger CPU bugs.
- We could address this by not allowing dynamic loading when there are multiple threads, or by requiring other threads to be in a "parked" state. However, this would not be good for loading JITted code.
- The size of the dynamic code region has to be specified up front, when the executable is loaded; address space has to be traded off between code and data.
- Worse, memory must be allocated for the whole dynamic code region from the underlying OS: all the pages for the dynamic code region must be dirtied in order to fill them with HLT instructions. So, if an executable requests a dynamic code region of 256MB, it will use 256MB worth of memory bandwidth; if this region is unused and is swapped out, it will cause 256MB worth of disc IO and consume 256MB of swap space.
- This could be a problem for limited-memory devices such as phones or netbooks. It will waste energy as well as memory.
We could run out of swap space, cause thrashing or (on Linux) trigger the OOM killer if many NaCl processes are running.
We lose the ability to do useful resource accounting for web sites using NaCl.
- Does not allow memory occupied by libraries to be shared between processes.
Implementation: CC-MProtect
We could fix the memory usage problem of CC-HLTRewrite by using page protections. Instead of filling the entire dynamic code region with HLTs on startup, we initially set the page mappings in the read view to be unreadable (no permission bits set). Whenever we need to allocate a page, we use the write view to fill it with HLTs, and then make the read view mapping readable using mprotect() (or its Window equivalent, assuming there is one). This assumes that the OS allocates pages on demand, when we write the HLTs rather than when we map the pages.
Implications for dynamic linking
One of the main points of dynamic linking is that programs don't know in advance what they will be loading, so they don't know how much space to reserve. With the scheme above, programs will want to reserve a large amount of space to be on the safe side. For example, if address space is 1Gb, we might reserve a large proportion of that for code, maybe 512MB or 256MB. Otherwise, a process could get into a situation where it can't continue when, say, a required plugin cannot be loaded because insufficient space was reserved up front for code, while plenty of address space remains for data and plenty of memory remains available.
If we knew in advance what libraries we were going to load, we wouldn't need dynamic loading support. Instead we could concatenate our libraries (.so files) and executable into one big executable before running sel_ldr. We could fudge the dynamic linker to find the pre-loaded libraries in memory rather than loading them from the virtual filesystem.
Dynamic library segment layout
ELF dynamic libraries are normally set up so that a library's data segment immediately follows its code segment. (On x86-64 systems there is a ~1MB gap between the code and data segments in order to support hypothetical systems with a 1MB page size. The resulting address space wastage is not considered significant when you have a 48-bit address space to play with.) This means that code and data are interleaved in address space, i.e.
0x100000: library 1 code
0x180000: library 1 data
0x200000: library 2 code
0x280000: library 2 data
If NaCl's dynamic loading facility does not support interleaved code and data, we would have to change the layout to something like this:
0x00100000: library 1 code
0x00180000: library 2 code
0x10100000: library 1 data
0x10180000: library 2 data
This example assumes that at most 256MB (0x10000000) is set aside for loading code.
Once an ELF shared library is linked to become a .so file, its code and data segments are fixed relative to each other (with a caveat). The ELF shared library is relocatable, but whatever offset the code segment is moved by, the data segment must be moved by too. This is what the ELF Program Headers format assumes, and any dynamic loader will assume. This means the size of the gap between segments gets linked in to the shared library at link time. It is straightforward to specify the gap size by changing the linker script. Shared libraries linked with different segment gap sizes will be difficult to load together (because of the difficulty of allocating address space), so we may want to choose a standard segment gap size, such as 256MB.
This layout wastes some data address space. Suppose library 1's code segment is 192k and its data segment is 64k (after rounding up to a page size of 64k). We will have to set aside 192k of address space for the data segment so that library 2's segments (which are fixed relative to each other) can fit in after library 1's segments. Hence we waste 128k of address space. However, we don't waste memory because nothing needs to be mapped into this space. Furthermore, this is really fragmentation rather than wastage because mmap() could still allocate from these gaps. Note that space in the code region can be similarly wasted if the library's data segment is larger than the code segment, but it is more usual for the code segment to be larger.
There are two ways in which we might defer the choice of segment gap size until after linking the .so:
Use ld's --emit-relocs option. This is the caveat mentioned above. This tells ld to include all ELF relocations in the output. It may be possible to use this information to rewrite the .so file to move the segments relative to each other. However, it is not clear that this is the purpose of --emit-relocs, and it may be easier to simply re-link the .so file from the original inputs to ld.
- We could use this to change sets of libraries and executables en masse from one segment gap size to another. So if a gap size of 256MB turns out to be too small and we want to load more than 256MB of code, we can rewrite our ELF objects to use a gap size of 300MB without having to rebuild them.
- Extend ELF with a new Program Headers format in which code and data segments are not fixed relative to each other. In this scenario, each dynamic relocation gains an extra flag to say whether it is relative to code or data, and the dynamic linker is extended to understand these relocations.
This means libraries and executables do not have an inbuilt preferred segment gap size. The main advantage of this scheme is that it can avoid the address space wastage mentioned above. The resulting executables would contain a load of TEXTRELs (relocations in the code/text segment). TEXTRELs are usually considered to be bad, because they prevent the memory for the code from being shared between processes. Although NaCl does not currently implement sharing code via mmap, pervasive use of TEXTRELs prevents this sharing from being introduced in the future, so this seems like a step in the wrong direction. Extending ELF would involve a lot of toolchain work to implement a feature that no-one else would use, just to save some address space fragmentation, so implementing this does not seem worthwhile.
Interface 2: InterleavedCode
In this interface scheme, NaCl's mmap() call is extended to provide a verify-and-map-code operation. Code and data may be interleaved in the NaCl process's address space. Code may be mapped with page granularity, which is 64k for compatibility with Windows.
Implementation: IC-NXBit
On systems that support it, this can be implemented using NX page protection.
Architectures:
- x86: recent processors support the NX bit
- x86-64: all systems have NX bit?
- ARM: has an NX bit as of ARMv7, which I believe covers all ARM systems we are likely to want to target
There are other architectures that we may not target ourselves. However, the ability of NaCl to be ported to them may affect community adoption:
- PowerPC: appears to have an NX bit
- MIPS: appears not to have an NX bit
Implementation: IC-HarvardX86
On x86 systems without the NX bit, we implement this using x86 segmentation. The x86 code segment is set up to be disjoint from the x86 data segment. Pages that contain validated code are mapped into the region covered by the x86 code segment. Other pages are mapped into the x86 data segment. As an example, sel_ldr's address space would contain the following after loading the initial executable:
x86 code segment covers this 512MB region: |
x86 data segment covers this 512MB region: |
code_start + 0: unmapped 64k |
data_start + 0: unmapped 64k |
code_start + 0x10000: syscall trampolines (trusted code) |
data_start + 0x10000:unmapped 64k |
code_start + 0x20000: executable's code segment (validated code) |
data_start + 0x20000: unmapped (alternatively, the code could be mapped as read-only) |
code_start + executable_code_size: unmapped |
data_start + executable_code_size: executable data segment |
unmapped |
heap |
unmapped |
unmapped |
unmapped |
stack |
(Note that as before, "unmapped" really means "mapped with no permission bits set". The x86 segments are shown side by side to illustrate the correspondence though really one is after the other.)
This is a Harvard architecture-style approach, because any given address in the NaCl process's address space could potentially have two meanings depending on whether it is used for code or data access. In practice we would probably not allow differing code and data pages to be mapped at the same address, because of the potential for confusion that this could cause.
x86 code segment size
The layout above halves the amount of address space available to the NaCl process from 1024MB to 512MB. However, we could reduce the address space loss if it were acceptable for address space to be non-uniform.
Suppose Windows lets us allocate 1024MB of sel_ldr's address space. We could allocate 300MB to the x86 code segment and 724MB to the x86 data segment. Hence the NaCl process sees 724MB of address space, but it can map code only into the bottom 300MB. However code is still allowed to be interleaved with data. The remaining 424MB can only contain data. Though the resulting address space is non-uniform, it is still more flexible than the ContiguousCode scheme.
Evaluation
Pros:
- Less need to pre-allocate address space (or actual memory) to trade off expected code and data sizes.
- Closer to existing ABIs.
- More amenable to allowing memory occupied by libraries to be shared between processes.
- Could reduce the need to have the ELF loader as trusted code. Startup could reduce to mmap operations: the invoker could provide a list of segments to map on startup. Parsing the ELF executable could be done in an untrusted Javascript library.
Cons:
- May require the underlying OS to provide a mechanism to map pages atomically. Linux has mremap(), which overwrites the destination mappings. Windows does not have an equivalent; the original pages must be unmapped first. However, it may be sufficient to have an mprotect() call which can switch executable permission on.
Hybrid interfaces
We could combine the benefits of interleaved code and data with the benefits of loading chunks of code that are smaller than page size. NaCl could provide operations to map some HLT-filled pages with code, and incrementally fill those pages with validated code using the HLT-overwriting technique above.
Deallocation of code
In some situations it will be desirable to be able to unload code (e.g. dlclose()) so that the space can be reused.
Dealing with jumps
Code is loaded in chunks. If chunks are allowed to contain internal, unaligned jumps (i.e. jumps to validated instructions in the middle of instruction bundles), we must ensure that code is unloaded in chunks too, which means we must record which chunks have been loaded. This applies whether we load code by mapping pages or by overwriting HLTs. If all direct jumps are required to be aligned, this saves us the trouble of having to remember chunks.
Dealing with multiple threads
What happens in the presence of multi-threading? We must deal with the case where other threads are executing the code that we are attempting to unload.
For IC-HLTRewrite: Although it appears to be possible to overwrite HLTs safely, going in the opposite direction -- overwriting multi-byte instructions with HLTs -- probably cannot be made safe. This means we need a way to "park" the other threads. It would be good if the OS provides a way to pause threads; otherwise, the threads must voluntarily perform NaCl syscalls to enter a parked state. If we solve this problem it may help with loading code as well as unloading code.
For InterleavedCode: If we unmap a page (or remove its execute permission), any thread executing code from that page should eventually fault. The problem is that we must find some way to ensure that the thread has had a chance to execute (or else is not currently executing code from that page) before mapping new code into the same location.
