fork() can deadlock
- Found in: 1.19
CategoryFixed: fixed in 815
Multi-threaded use of fork() in PlashGlibc can cause deadlocks. This can be reproduced by glibc's nptl/tst-fork1 test case (see StoryTest3).
The problem is that the kernel_fork() call can occur while another thread is holding the lock. Since the other thread does not exist in the forked process, the lock is never released in the forked process, and so the next time the lock is claimed a deadlock will occur. The problem (at least in this form) was introduced in 743.
This problem is described in the specification for pthread_atfork(). We could register a pthread_atfork() handler to address this problem, but that is not necessary because we are implementing fork() itself! Instead of releasing the lock when doing kernel_fork(), we should keep holding the lock and only release it at the end of the fork() function (where it gets released in the parent and child processes).
Are pthreads mutexes usable after a fork()? The pthread_atfork man page (written by Xavier Leroy and presumably for the old linuxthreads implementation says: "fork(2) duplicates the whole memory space, including mutexes in their current locking state, but only the calling thread: other threads are not running in the child process. The mutexes are not usable after the fork and must be initialized with pthread_mutex_init in the child process. This is a limitation of the current implementation and might or might not be present in future versions."
It might be possible to replace PlashGlibc's fork() wrapper with pthread_atfork() handlers.
glibc's internal locks
How does the rest of glibc deal with this problem? Some cases do not use pthread_atfork. nptl/sysdeps/unix/sysv/linux/fork.c contains special cases for reinitializing the mutexes that glibc uses:
/* Reset the file list. These are recursive mutexes. */
fresetlockfiles ();
/* Reset locks in the I/O code. */
_IO_list_resetlock ();
/* Reset the lock the dynamic loader uses to protect its data. */
__rtld_lock_initialize (GL(dl_load_lock));
malloc/arena.c, on the other hand, uses atfork. It sets up an atfork handler in ptmalloc_init(), called from __libc_malloc_pthread_startup(). The handler contains the following comment:
/* In NPTL, unlocking a mutex in the child process after a fork() is currently unsafe, whereas re-initializing it is safe and does not leak resources. Therefore, a special atfork handler is installed for the child. */
There is a reference to a thread_atfork_static mechanism which is available on Hurd but not on Linux with NPTL (it is #undef'd in nptl/sysdeps/pthread/malloc-machine.h).
What is permitted after fork()?
This specification for fork() states that multi-threaded programs may only use async-signal-safe functions between fork() and execve(). But pthread_mutex_init() and pthread_mutex_unlock() are not async-signal-safe, so this would make pthread_atfork() useless.
One part of that seems to say that while fork() is theoretically async-signal-safe (usable from a signal handler), atfork handlers usually are not, so in practice fork() should not be used from a signal handler. That has very little consequence. Being forbidden from using malloc() between fork() and execve() is a much more serious restriction, although this does not appear to apply in glibc.
This post by Dave Butenhof (Re: EPERM from pthread_mutex_unlock after fork using pthread_atfork()) discusses the issue.
Test case
tst-fork1 does not use glibc's test-skeleton.c and so it does not time out, which blocks the rest of the test suite from running. Changing it to use test-skeleton.c is easy, but after doing that I can no longer reproduce the deadlock, and I can't explain why that would make a difference.
Leaking file descriptors
A related problem is that the current implementation can leak file descriptors into the forked processes: in between clone_connection() and kernel_fork(), another fork() call can occur, which would create a child process with a leaked FD. Holding the lock throughout fork() would fix this. Setting the CLOFORK (close-on-fork) bit would fix this, but that would only be available on newer kernels.
strace logs
The log for a non-deadlocked thread in nptl/tst-fork1 looks like this:
cloned 5: started pid 18970
cloned 5: set_robust_list(0x428049f0, 0x18 <unfinished ...>
cloned 5: <... set_robust_list resumed> ) = 0
cloned 5: futex(0x2b79a1b2049c, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
cloned 5: <... futex resumed> ) = 0
cloned 5: futex(0x2b79a1b2049c, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
cloned 5: <... futex resumed> ) = 0
cloned 5: futex(0x2b79a1b1e2c0, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
cloned 5: <... futex resumed> ) = 0
cloned 5: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\30\0\0\0\0\0\0\0Invk\0\1\0\0\1\0\0\0\2\0\0\0Call"..., 36}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL <unfinished ...>
cloned 5: <... sendmsg resumed> ) = 36
cloned 5: futex(0x2b79a1b1fba0, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
cloned 5: <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
cloned 5: futex(0x2b79a1b1fba0, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
cloned 5: <... futex resumed> ) = 0
cloned 5: recvmsg(3, <unfinished ...>
cloned 5: <... recvmsg resumed> {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\24\0\0\0\0\0\0\0Invk\0\0\0\0\1\0\0\0\1\2\0\0RCap"..., 10240}], msg_controllen=0, msg_flags=0}, 0) = 32
cloned 5: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!$\0\0\0\0\0\0\0Invk\0\0\0\0\3\0\0\0\2\0\0\0\0\0\0\0"..., 48}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL <unfinished ...>
cloned 5: <... sendmsg resumed> ) = 48
cloned 5: recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\20\0\0\0\1\0\0\0Invk\0\0\0\0\0\0\0\0RMkcMSG!"..., 10208}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=0}, 0) = 28
cloned 5: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\10\0\0\0\0\0\0\0Drop\0\2\0\0", 20}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL) = 20
cloned 5: futex(0x2b79a1b1e2c0, 0x81 /* FUTEX_??? */, 1) = 0
cloned 5: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x428049e0) = 18973
cloned 5: close(4 <unfinished ...>
cloned 5: <... close resumed> ) = 0
cloned 5: wait4(18973, <unfinished ...>
cloned 5: <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 3}], 0, NULL) = 18973
cloned 5: --- SIGCHLD (Child exited) @ 0 (0) ---
cloned 5: fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
cloned 5: mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaab000
cloned 5: _exit(0) = ?
----
cloned 5: cloned 1: started pid 18973
cloned 5: cloned 1: close(3 <unfinished ...>
cloned 5: cloned 1: <... close resumed> ) = 0
cloned 5: cloned 1: dup2(4, 3) = 3
cloned 5: cloned 1: close(4) = 0
cloned 5: cloned 1: fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
cloned 5: cloned 1: mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b79a1b24000
cloned 5: cloned 1: nanosleep({0, 300000000}, NULL) = 0
cloned 5: cloned 1: exit_group(3) = ?
whereas a deadlocked thread's forked process gets blocked on a futex() syscall:
cloned 4: started pid 18969
cloned 4: set_robust_list(0x420039f0, 0x18 <unfinished ...>
cloned 4: <... set_robust_list resumed> ) = 0
cloned 4: futex(0x2b79a1b1e2c0, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
cloned 4: <... futex resumed> ) = 0
cloned 4: futex(0x2b79a1b2049c, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
cloned 4: <... futex resumed> ) = 0
cloned 4: futex(0x2b79a1b2049c, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
cloned 4: <... futex resumed> ) = 1
cloned 4: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\30\0\0\0\0\0\0\0Invk\0\1\0\0\1\0\0\0\2\0\0\0Call"..., 36}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL <unfinished ...>
cloned 4: <... sendmsg resumed> ) = 36
cloned 4: recvmsg(3, <unfinished ...>
cloned 4: <... recvmsg resumed> {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\24\0\0\0\0\0\0\0Invk\0\0\0\0\1\0\0\0\1\2\0\0RCap"..., 10180}], msg_controllen=0, msg_flags=0}, 0) = 32
cloned 4: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!$\0\0\0\0\0\0\0Invk\0\0\0\0\3\0\0\0\2\0\0\0\0\0\0\0"..., 48}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL) = 48
cloned 4: recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\20\0\0\0\1\0\0\0Invk\0\0\0\0\0\0\0\0RMkc\0\0\0\0"..., 10148}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=0}, 0) = 28
cloned 4: sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"MSG!\10\0\0\0\0\0\0\0Drop\0\2\0\0", 20}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, ...}, msg_flags=0}, MSG_NOSIGNAL) = 20
cloned 4: futex(0x2b79a1b1e2c0, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
cloned 4: <... futex resumed> ) = 1
cloned 4: clone( <unfinished ...>
cloned 4: <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x420039e0) = 18972
cloned 4: futex(0x2b79a1b1fba0, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
cloned 4: <... futex resumed> ) = 0
cloned 4: close(4 <unfinished ...>
cloned 4: <... close resumed> ) = 0
cloned 4: wait4(18972, <unfinished ...>
----
cloned 4: cloned 1: started pid 18972
cloned 4: cloned 1: close(3) = 0
cloned 4: cloned 1: dup2(4, 3 <unfinished ...>
cloned 4: cloned 1: <... dup2 resumed> ) = 3
cloned 4: cloned 1: close(4) = 0
cloned 4: cloned 1: futex(0x2b79a1b1e2c0, 0x80 /* FUTEX_??? */, 2 <unfinished ...>
This strace output was formatted with the help of scratch/strace-log.
