Using event loops and file descriptors
This page is for notes about how event loops and file descriptors interact on Unix and, specifically, Linux.
Contents
Types of file descriptor
- files
- pipes
- TCP sockets (AF_INET)
- Unix domain sockets (AF_UNIX)
- ttys
- other devices, e.g. /dev/null, /dev/zero
- directories (not relevant for this discussion because they are not streams)
End-of-file and error conditions
Some FD types indicate end-of-file/error conditions by setting POLLERR/POLLHUP in the flags returned by poll(). Other FD types do not but will just return EOF from read() (i.e. return zero bytes).
read() itself can return the ECONNRESET error ("Connection reset by peer"). This happens when A tries to read from one end of a socket, having sent data across the socket to B, and B has closed its socket without having read the data. This error can occur before read() returns pending data that B has sent before closing its socket, so B's data can be lost. This can occur with TCP and Unix sockets. It cannot occur with pipes since pipes are unidirectional.
ttys can return EOF events multiple times. An EOF can be generated when the user presses Ctrl-D at the terminal.
Buffering and POLLHUP
If we are doing flow control, we read data into a bounded buffer, and we do not want to read when our buffer is full, even if there is data waiting to be read in the FD's buffer. Holding off from reading means that the FD's buffer can fill up so that if the writer keeps writing, it will eventually block.
Normally this is easy to do: When we don't want to read, don't pass POLLIN to poll(), and poll() will not notify us even when there is data to read. This ceases to work when the writer has closed its FD, because poll() can start to report POLLHUP on our FD. Unfortunately, unlike POLLIN, POLLHUP cannot be turned off for an FD registered with poll(). That means that if we do not read the pending data (because our buffer is full), poll() will continually report POLLHUP, leading to a busy loop.
Possible workarounds:
- Don't pass the FD to poll() at all when we are not ready to read. This doesn't help when the FD is a bidirectional socket, used for both reading and writing, because we might want to ask for POLLOUT even if we don't want POLLIN or POLLHUP. If sockets don't report POLLHUP but (unidirectional) pipes do, this might not matter, though it would require treating sockets and pipes differently.
- Use select(), with which exception events are optional.
- If we receive POLLHUP, read the entire remaining contents of the FD's buffer. This means that we can no longer strictly bound our own buffer: on connection end, we have to extend our buffer by the size of the pipe/socket's buffer. However, the buffer size would still be bounded (assuming that the kernel's FD buffers are always bounded). This would give an observable difference from SSH's behaviour below.
How does SSH handle this? We can test SSH in this situation with the following command:
yes | (head -c200000; echo "done writing" 1>&2) | ssh localhost "sleep 2; wc --bytes"
The writer writes 200,000 bytes, which is enough to fill SSH's buffer but not completely fill the pipe as well, which is confirmed by "done writing" being printed. For 2 seconds, there is no reader, and SSH blocks on a select() call (this is revealed by stracing SSH). SSH later correctly reads the remaining queued data. "wc" confirms that it received it all by printing "200000". SSH handles this case by using select() instead of poll() and by not passing the stdin FD to select() at all when its buffer is full.
Solution
There is not actually a problem with poll() here. Once you have received POLLHUP for an FD, no new data can arrive in the FD's buffer. You can read the remaining buffered data from the FD immediately, or you can fetch it later. If you fetch the data later, this can be triggered by space becoming available in your process's buffer, rather than by receiving events for the FD via poll().
Doing IO safely, without blocking
Writing
Ideally we want to be able to write data to an FD with a guarantee that the operation will not block, because blocking would cause denial of service for any other streams that the process might be managing. This turns out to be hard to do in the general case.
With sockets, we can use send() with the MSG_DONTWAIT flag. However, send() returns ENOTSOCK on pipes (despite the man page stating that "With zero flags parameter, send() is equivalent to write()"). It is possible to enable non-blocking mode on pipes by using fcntl() and F_SETFL to set O_NONBLOCK. However, if the file descriptor is shared between processes, there is a race condition. Another process could unset O_NONBLOCK. (See UsefulKernelChanges.)
How does SSH deal with this? Like Plash's FD forwarder, SSH must be able to forward data to an arbitrary file descriptor (e.g. forwarding to a stdout pipe), and SSH is event loop based. How does SSH manage to write to a (possibly shared) pipe FD without it blocking other streams, such as X11 forwarding?
Can test SSH with this command:
(strace ssh localhost yes | sleep 999) 2>&1 | tee strace-log
The log includes:
dup(1) = 5 fcntl64(5, F_GETFL) = 0x1 (flags O_WRONLY) fcntl64(5, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
I have discovered that poll() will not return POLLOUT for a pipe after more than 0xf000 bytes have been written to it. On Linux, the pipe buffer size is 0x10000 (64k), while PIPE_BUF = 0x1000 (4k). PIPE_BUF is the largest amount of data that you can write to a pipe atomically, without interleaving with other writers (this is documented on this pipe(7) man page). So POLLOUT indicates when it is possible to write PIPE_BUF bytes to the buffer without blocking. This does not seem to be documented anywhere. This means you can avoid blocking on writes to pipes by always writing PIPE_BUF or less bytes.
Reading
The man page for select() states that: "Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block."
Checking FD readability and writability
If you use read() on a non-readable FD (e.g. an FD that was opened write-only) or use write() on a non-writable FD, the call returns EBADF. Note that EBADF is overloaded: it also occurs when the FD table slot is empty. (Recent kernel changes have overloaded it further for revoked FDs.)
However, if you wait until poll() returns POLLIN or POLLOUT for the FD before calling read() or write() respectively, you will be waiting forever. poll() will not tell you if the event you asked it to wait for is not relevant for the FD and will never occur (although it will tell you if the FD number is not valid by returning POLLNVAL). You need to check beforehand by using fcntl()/F_GETFL and looking at the FD's access mode.
select() behaves differently to poll() in this regard. For /dev/null FDs, select() will report an FD as ready-to-read if the FD was not opened for reading. This behaviour is arguably more sensible, because the purpose of select() and poll() is to determine when a read or write call will not block. Note that SSH uses select() and so behaves correctly for this case although SSH does not check readability/writability using fcntl().
However, select() does not behave consistently, and it does not report a read pipe as ready-to-write. This means SSH will hang in some cases. For example:
$ true | ssh localhost /bin/echo hello 1>&0 <hangs...>
Without SSH involved, this gives:
$ true | /bin/echo hello 1>&0 /bin/echo: write error: Bad file descriptor
Gotchas of glib's event loop
glib treats its FD watches as non-reentrant. This becomes important when using nested event loops. If a glib FD watch handler causes another event loop to be entered, glib will not re-enter the FD watch. The workaround (if necessary) is to register a new FD watch.
glib allows FD watches to omit POLLERR and POLLHUP from their event masks, and it will not deliver these events to the watch if they occur. However, these events are not optional for poll(): it will always deliver them if they occur on an FD that is passed to poll(). This means that naive use of glib FD watches could lead to a busy wait, as poll() never blocks but continually returns POLLERR, while the process does not respond by removing the FD watch.
It is not clear what the lifetime of source IDs is. Can source_remove() be safely called more than once on the same ID, or could the ID be reused for another FD watch?
Testing code based on event loops
It is sometimes difficult to write unit test cases for code that uses an event loop, because:
- With the glib and Twisted event loops, the list of FD watches is global. An FD watch left behind by one unit test can interfere with later test cases, causing problems that are difficult to track down.
- Blocking operations are involved. If a test fails by deadlocking and blocking indefinitely, it makes it difficult to debug the test. Test runners like Python's unittest don't detect wedged tests, so you have to kill the test runner by hand.
- There is a temptation to get around the global event loop problem by fork()ing subprocesses, but that makes the test harder to manage. Firstly, it makes the test non-deterministic. Secondly, it makes it hard to make assertions about the states of the subprocesses.
The event loop framework in plash/python/plash/comms/event_loop.py gets around these problems by making the event loop a first class object, which must be passed in to any code that wants to register FD watches. The event loop can be run one step at a time, in non-blocking mode (while asserting that blocking would not occur), or it can be run until it would block. All the code is run in the same process, which generally makes it deterministic, though it is still at the mercy of how the kernel implements FD operations. The event loop can be implementing using poll() directly (via Python's select.poll) or using glib. When using glib, it removes all the glib FD watches it had registered on test teardown.
This framework makes it easy to make assertions across interacting components. For example, to test flow control in an FD stream forwarder, we put a big chunk of data into the forwarder's input pipe (while there is no reader for its output pipe), run the loop until it blocks, and assert that the forwarder has only read some of that data into its own buffer. Then we remove data from the forwarder's output pipe, run the loop again until it blocks, and assert that the forwarder read the rest of the data from its input pipe.
