Dynamic loading in Native Client

This page has been moved to http://code.google.com/p/nativeclient/wiki/DynamicLoadingOptions

Introduction

Native Client needs to ensure that only validated code can be executed. This requires a mechanism to separate code and data. If we are to extend Native Client to support dynamic loading of code, there are two classes of interface we might provide for doing it:

Current scheme

The current sandboxing scheme supports statically-linked executables only. It uses the following address space layout:

The Native Client process is limited so that the only instructions it can execute are below code_top. There are two mechanisms for doing this:

Interface 1: ContiguousCode

This interface scheme adds an extra region into address space into which code can be dynamically loaded. This region appears after the executable's code segment, before its data segment, so that the address space layout becomes:

Implementation: CC-HLTRewrite

One proposed implementation for this interface works as follows:

Pros:

Cons:

Implementation: CC-MProtect

We could fix the memory usage problem of CC-HLTRewrite by using page protections. Instead of filling the entire dynamic code region with HLTs on startup, we initially set the page mappings in the read view to be unreadable (no permission bits set). Whenever we need to allocate a page, we use the write view to fill it with HLTs, and then make the read view mapping readable using mprotect() (or its Window equivalent, assuming there is one). This assumes that the OS allocates pages on demand, when we write the HLTs rather than when we map the pages.

Implications for dynamic linking

One of the main points of dynamic linking is that programs don't know in advance what they will be loading, so they don't know how much space to reserve. With the scheme above, programs will want to reserve a large amount of space to be on the safe side. For example, if address space is 1Gb, we might reserve a large proportion of that for code, maybe 512MB or 256MB. Otherwise, a process could get into a situation where it can't continue when, say, a required plugin cannot be loaded because insufficient space was reserved up front for code, while plenty of address space remains for data and plenty of memory remains available.

If we knew in advance what libraries we were going to load, we wouldn't need dynamic loading support. Instead we could concatenate our libraries (.so files) and executable into one big executable before running sel_ldr. We could fudge the dynamic linker to find the pre-loaded libraries in memory rather than loading them from the virtual filesystem.

Dynamic library segment layout

ELF dynamic libraries are normally set up so that a library's data segment immediately follows its code segment. (On x86-64 systems there is a ~1MB gap between the code and data segments in order to support hypothetical systems with a 1MB page size. The resulting address space wastage is not considered significant when you have a 48-bit address space to play with.) This means that code and data are interleaved in address space, i.e.

If NaCl's dynamic loading facility does not support interleaved code and data, we would have to change the layout to something like this:

This example assumes that at most 256MB (0x10000000) is set aside for loading code.

Once an ELF shared library is linked to become a .so file, its code and data segments are fixed relative to each other (with a caveat). The ELF shared library is relocatable, but whatever offset the code segment is moved by, the data segment must be moved by too. This is what the ELF Program Headers format assumes, and any dynamic loader will assume. This means the size of the gap between segments gets linked in to the shared library at link time. It is straightforward to specify the gap size by changing the linker script. Shared libraries linked with different segment gap sizes will be difficult to load together (because of the difficulty of allocating address space), so we may want to choose a standard segment gap size, such as 256MB.

This layout wastes some data address space. Suppose library 1's code segment is 192k and its data segment is 64k (after rounding up to a page size of 64k). We will have to set aside 192k of address space for the data segment so that library 2's segments (which are fixed relative to each other) can fit in after library 1's segments. Hence we waste 128k of address space. However, we don't waste memory because nothing needs to be mapped into this space. Furthermore, this is really fragmentation rather than wastage because mmap() could still allocate from these gaps. Note that space in the code region can be similarly wasted if the library's data segment is larger than the code segment, but it is more usual for the code segment to be larger.

There are two ways in which we might defer the choice of segment gap size until after linking the .so:

Interface 2: InterleavedCode

In this interface scheme, NaCl's mmap() call is extended to provide a verify-and-map-code operation. Code and data may be interleaved in the NaCl process's address space. Code may be mapped with page granularity, which is 64k for compatibility with Windows.

Implementation: IC-NXBit

On systems that support it, this can be implemented using NX page protection.

Architectures:

There are other architectures that we may not target ourselves. However, the ability of NaCl to be ported to them may affect community adoption:

(sources: Gentoo, PaX)

Implementation: IC-HarvardX86

On x86 systems without the NX bit, we implement this using x86 segmentation. The x86 code segment is set up to be disjoint from the x86 data segment. Pages that contain validated code are mapped into the region covered by the x86 code segment. Other pages are mapped into the x86 data segment. As an example, sel_ldr's address space would contain the following after loading the initial executable:

x86 code segment covers this 512MB region:

x86 data segment covers this 512MB region:

code_start + 0: unmapped 64k

data_start + 0: unmapped 64k

code_start + 0x10000: syscall trampolines (trusted code)

data_start + 0x10000:unmapped 64k

code_start + 0x20000: executable's code segment (validated code)

data_start + 0x20000: unmapped (alternatively, the code could be mapped as read-only)

code_start + executable_code_size: unmapped

data_start + executable_code_size: executable data segment

unmapped

heap

unmapped

unmapped

unmapped

stack

(Note that as before, "unmapped" really means "mapped with no permission bits set". The x86 segments are shown side by side to illustrate the correspondence though really one is after the other.)

This is a Harvard architecture-style approach, because any given address in the NaCl process's address space could potentially have two meanings depending on whether it is used for code or data access. In practice we would probably not allow differing code and data pages to be mapped at the same address, because of the potential for confusion that this could cause.

x86 code segment size

The layout above halves the amount of address space available to the NaCl process from 1024MB to 512MB. However, we could reduce the address space loss if it were acceptable for address space to be non-uniform.

Suppose Windows lets us allocate 1024MB of sel_ldr's address space. We could allocate 300MB to the x86 code segment and 724MB to the x86 data segment. Hence the NaCl process sees 724MB of address space, but it can map code only into the bottom 300MB. However code is still allowed to be interleaved with data. The remaining 424MB can only contain data. Though the resulting address space is non-uniform, it is still more flexible than the ContiguousCode scheme.

Evaluation

Pros:

Cons:

Hybrid interfaces

We could combine the benefits of interleaved code and data with the benefits of loading chunks of code that are smaller than page size. NaCl could provide operations to map some HLT-filled pages with code, and incrementally fill those pages with validated code using the HLT-overwriting technique above.

Deallocation of code

In some situations it will be desirable to be able to unload code (e.g. dlclose()) so that the space can be reused.

Dealing with jumps

Code is loaded in chunks. If chunks are allowed to contain internal, unaligned jumps (i.e. jumps to validated instructions in the middle of instruction bundles), we must ensure that code is unloaded in chunks too, which means we must record which chunks have been loaded. This applies whether we load code by mapping pages or by overwriting HLTs. If all direct jumps are required to be aligned, this saves us the trouble of having to remember chunks.

Dealing with multiple threads

What happens in the presence of multi-threading? We must deal with the case where other threads are executing the code that we are attempting to unload.

NativeClient/DynamicLoading (last edited 2009-12-17 12:22:54 by MarkSeaborn)