From POSIX to object-capabilities and back again
This is a sketch for a paper.
- Short description of Plash. Comparison with Ostia and the Ostia's policy mechanism.
The object interface to files/directories (FsObj interface).
- Important feature is that you can't get a directory's parent. traverse()/get_child() returns an object with less authority than the parent.
- Other major feature is that it never follows symlinks for you. It should be symlink-safe.
- What kind of abstractions does this enable?
Introduce some proxy objects: FsObjReadOnly, FsObjMountTable etc.
- Implementing object interface on top of POSIX: problems of race conditions
- Using directory FDs. Can use /proc or the *at() calls on new glibc/Linuxes. On older systems, or where /proc is not available, have to use fchdir().
- connect() on Unix domain sockets
Implementing the POSIX interface on top of the object interface. Done by FsOp.
- How do we get back the ".." (parent) feature of POSIX filenames? Use dir_stacks. Filename lookup algorithm.
- Differences in the resulting semantics: ".." behaves differently when directories are renamed. Easily detectable. Unlikely to matter in practice. Renaming directories while the directory or its pathname is in use tends to break things anyway.
Implementation detail: FsOp is in ServerProcess, not PlashGlibc, but this is only for performance.
- Performance: haven't tried to measure this so far
- Note on file descriptors: These are not virtualised (at least, file FDs are not). We rely heavily on being able to pass FDs between processes. File FDs are not yet revocable.
- Directory FDs: real directory FDs are dangerous. We use dir_stacks instead.
- How glibc handles this: opendir(), fstat(). Dummy FDs. Probably out of scope for this paper.
- Setuid executables: not supported in this model
- why chroot() is normally restricted to root
Limitations:
Race condition: create file vs. open existing file: FsOp's open() must choose whether to call dir_create_file() or dir_get_child()+file_open(), whereas open() is usually atomic. How do we deal with detectable races in general?
- Performance: Pathname lookup involves a series of round trips/domain crossings when the directory objects are remote. Could improve this by adding a get_by_path() method which traverses multiple levels until a symlink is encountered.
- Directories with "x" but not "r" permission: You can chdir() into them but you can't open() them (even if you don't intend to read the file list). Plash won't let you chdir() into them. Unix/Linux is just inconsistent here.
- Hard-linking vulnerability: e.g. Your ~/.bashrc is linked into /tmp by a conspiring user. This problem is caused by clashing access control models.
Comparisons:
- Ostia (mentioned above)
- Hurd: Has a similar object-based directory tree interface, but it supports ".." by default. It has a special wrapper for disabling the parent link, which is used by chroot. Filename lookup is done by glibc.
- Linux: Internal VFS interfaces.
- Union mounts and read-only mounts. It is taking a while for these to be added to the Linux kernel. What makes it difficult?
- Mount tables (especially bind mounts). Linux mount tables are not implemented as proxies/wrappers, but as a special layer of the system. Mount tables work on the basis of directory/file identity, not on the basis of pathnames.
Per-process mount tables. Presumably inspired by Plan 9. Does anyone actually use them? They still require root access. Other problem is that these namespaces are not first class objects, so hard to modify. Investigate proposed hijack() system call (see 1 and 2 from LWN).
- Plan 9
- As with Linux, mount tables have special status in the system, even though filesystems are implemented in user space.
- Plan 9 does not allow directories to be moved/renamed. That makes mount tables a lot simpler. They don't need to work using directory identity, but can be based on pathnames.
- Plan 9 omits symlinks. Very sensible.
- SELinux: blocks FD delegation
AppArmor: Its policies can be very similar to what Plash allows. However, the implementations are radically different.
AppArmor will show the whole global namespace to processes but then blocks access to things in that namespace, whereas Plash presents a restricted namespace.
AppArmor's implementation relies on being able to map an in-kernel file object back to its pathname. It then performs a check on that pathname (using a regexp?). Mapping back to a pathname is an odd approach. It's potentially racy, and it breaks down in the presence of chroots, bind mounts and hard linking. Lots of people objected to this implementation. I think it required a VFS change.
In contrast, Plash will perform access control at the point where the pathname is looked up in the namespace, or the point where the namespace is constructed. While it allows "policies" to be specified in terms of pathnames, it does not have problems with chroots and bind mounts. In fact, it encourages using those as access control abstractions. (Note that Plash does not have an entity called a "policy"; this is just for comparison with AppArmor.)
Of the objections to AppArmor, it is not clear how much is about pathname-based access control in general and how much about its implementation.
Mount tables might deserve a section of their own
Symlinks deserve a section of their own. They create two problems:
Avoiding following them. It's hard to use POSIX APIs in such a way that you don't follow symlinks accidentally. This isn't limited to Plash. Lots of core tools (such as rm) have to be symlink-careful. The main contribution of Plash's object interface is that it saves you from having to be symlink-careful.
- Implementing them.
- A symlink is interpreted in the namespace of the process looking at it rather than the namespace of the process that created it.
- Conclusion is that symlinks are a bad feature.
Key questions for the usefulness of an interface are: How well does it support abstraction? (Can it be virtualised?) How easy is it to use safely? We should ask those of the POSIX pathname interfaces.
Out of scope:
- Executable objects (leads on from setuid executable issue)
