ptrace()-based jailing for Plash
Status: planning
Aim: replace ChrootSetuidJail (which has various limitations) with a ptrace-based process monitor.
The process monitor should disable a set of syscalls, and let others through. We are not trying to interpose syscalls, except for a few exceptions such as fork(). We will still be using PlashGlibc. Interposing on calls such as open() is subject to various race conditions and cannot be done securely using ptrace().
The PTRACE_SYSCALL_MASK extension would make ptrace() efficient for our purposes, but it is not included in the mainline kernel. See Ptrace for general discussion of limitations of ptrace().
Contents
Difficulties
execve()
The monitor cannot allow the execve() syscall through because it involves a filename. We may need to implement execve() in userspace using mmap(). See UserModeExec.
Implementing execve() in userspace would be easier if memory mappings could be altered by a second process. User Mode Linux has a patch to the host kernel to provide such a facility. What is its status?
Ostia added an fexecve() call to Linux. It is not clear how this worked. execve() needs to load both the executable and the dynamic linker. The latter is specified in the executable.
Calls using process IDs
Calls such as wait() and kill() will have to be proxied via a trusted server. We will have to implement a process ID namespace in user space and keep track of which processes are part of the sandbox. wait() is the most important call to implement.
Delivery of signals such as SIGCHLD would have to be simulated, and this needs to work with select()/pselect().
- Signals can be sent by the tracer process. When a process is created, it can get given a capability for sending signals to itself. Does being ptraced affect the signal delivery mechanism?
Would we implement job control and process groups?
Costs
There will be an extra process per sandbox, assuming the monitor process is kept separate from the existing server process.
- Not necessarily: one ptracer can manage multiple processes and multiple sandboxes. There will be no explicit notion of a sandbox.
Limitations
A process cannot be ptraced by multiple processes, so strace and gdb would not work inside the ptrace jail.
System calls to allow
- dup, dup2, close
- read, write, send/sendto/sendmsg, recv/recvfrom/recvmsg, pread/pwrite, readv/writev, sendfile, splice
- pipe, socketpair
- select, poll
- nanosleep, setitimer, gettimeofday, time, times (unless you want to deny access to timer)
- fstat, ftruncate
- fcntl
- getsockopt, setsockopt, shutdown
- flock
- ioctl (maybe)
- fsync, fdatasync, sync (maybe)
- sgetmask, ssetmask (signal)
- mmap, munmap, mprotect, mremap, madvise, brk, sbrk
- mlock, mlockall, munlock, munlockall
- uname (but very easy to proxy)
Requiring special handling:
- clone, fork
- execve
Not sure:
- vm86
See list of all Linux syscalls
Alternatives
Linux seccomp patch: This leaves too few syscalls for our requirements, only read(), write(), close() and exit(). We need sendmsg()/recvmsg(), mmap(), among others.
lcall
At one point there was a (now-obsolete) system call mechanism called lcall (an alternative to the "int 0x80" syscall mechanism), which was not intercepted by ptrace. Apparently this was fixed by the User Mode Linux project.
See:
http://user-mode-linux.sourceforge.net/slides/ists2002/img11.htm
http://www.eros-os.org/pipermail/e-lang/2004-July/009885.html
Tasks
- Write ptracer that blocks a set of system calls
- Add support for following fork()/clone() and trace the new process
Hook up UserModeExec in PlashGlibc
- Handling process tree:
- Implement a process set object; processes are added to set by ptracer on fork()
- Reimplement waitpid() in glibc by invoking the process set object
- Add to ptracer the ability to send signals to processes
- When a process exits, send SIGCHLD to its parent process
