Optimize fork bottleneck in WSL
While fork()/clone() is generally very fast on native Linux (~30µs with a simple process), it's a lot slower on WSL: 2-5ms to fork a simple process, ~50ms to fork bst while building. With a large number of builders, as may be the case with remote execution, the fork cost may be a significant bottleneck. It's also exhibited when building the synthetic Debian benchmark project as the build actions are trivial imports.
The purpose of this ticket is to explore ideas to reduce this bottleneck in WSL.
- Try to figure out what exactly causes the slowdown from 2-5ms to 50ms. It may be the size of the file descriptor table, the size of the memory mapping, lock contention, or something else. As the WSL implementation is proprietary, it may not be easy to track this down.
- Is it faster to create child processes with
spawn
orforkserver
instead offork
? While Linux (and thus, also WSL) doesn't have a syscall forposix_spawn
, glibc usesCLONE_VFORK
for itsposix_spawn
implementation, which may be faster thanfork
on WSL. The pickling overhead of thespawn
method might be bigger than the current fork overhead, though. - Could we fork via a thread pool to not block the main thread? This might require a patch to the Python interpreter (or maybe a C module) to release the GIL during the fork call.