Potentially useful Linux kernel changes
This is a list of changes to the Linux kernel that could be useful for Plash.
See also LinuxKernelChanges.
fconnectat()
A symlink-race-free variant of connect() is needed to fix PlashIssues/ConnectRaceCondition.
System call mask
Allow system calls to be disabled selectively. This could replace ChrootSetuidJail. Jeff Dike implemented PTRACE_SYSCALL_MASK for ptrace() but this has not made it into the mainline kernel.
Allow send() on pipes
write() on a pipe will block when not all the data can be written because the pipe's buffer is full. In an event loop based process, the more useful behaviour is to write whatever data will fit. With sockets, such a non-blocking operation can be obtained by using send() with the MSG_DONTWAIT flag. However, send() returns ENOTSOCK on pipes (despite the man page stating that "With zero flags parameter, send() is equivalent to write()"). It is possible to enable non-blocking mode on pipes by using fcntl() and F_SETFL to set O_NONBLOCK. However, if the file descriptor is shared between processes, there is a race condition. Another process could unset O_NONBLOCK and cause a denial of service.
Denys Vlasenko has proposed changing send() to work on pipes and other non-socket FDs: "O_NONBLOCK is broken". As of 2007/12/01, this has not been changed in the kernel.
(Note that this discussion refers to "file descriptions", meaning the part of FDs that may be shared between processes. That term is not widely used, and I use "file descriptor" to refer to the object that can appear in multiple processes' file descriptor tables.)
See EventLoopAndFDs.
An fexecve() syscall
Provide an fexecve() syscall, like execve() but with executable being specified by an FD not a filename. The difficulty of handling execve() is the main obstacle to implementing a jail based on system call blocking such as PtraceJail.
Ostia implements an fexecve() syscall, but the paper doesn't specify details, such as how the ELF interpreter (usually ld.so) is handled. The ELF interpreter should probably be passed in as another argument as a file descriptor.
glibc provides an fexecve() library call, but it is implemented using execve() on /proc so would not be usable in a PtraceJail.
Longer term: more FD types
Message-based IPC, instead of the stream-based IPC that Unix domain sockets provide. This could provide an invocation model similar to KeyKOS/EROS or Coyotos. This would replace the PlashObjectCapabilityProtocol.
- Removes the need for a bus process to forward messages, which could give better performance. Removes the need to worry about how bus processes are organised.
Simplifies user space code: there would no longer be a need to worry about how ld.so, libc.so and other code occupying the same process share a connection. This problem appears to be the main obstacle to virtualising access at the kernel ABI level the way Ostia does (see InterceptingSystemCalls): a syscall-wrapper library would need to be threadsafe for threads to share a connection, and that is hard to do without depending on libc and libpthread.
Provide FDs for accessing the internals of processes, such as
- memory maps
- file descriptor tables
