Address the race with command execution and context cancellation reaping it
Our commands can sometimes fail due to context cancellation fail but not surface any error. This can lead to incorrect results as the calling code will see the command as successful when in fact it didn't succeed. This is caused by a race with the reaper goroutine in the command
package that terminates the command on context cancellation. When context is canceled the command
package sends a SIGTERM
, drains stdout
and calls Wait
.
If another goroutine reads from the command between the draining and the Wait
call, it will see empty output as it was already drained. If the actual call was successful, then it will not receive any error either which is an incorrect result.
It's hard to estimate what sort of impact this is having on operations as each operation is different. Some call sites could proceed with incorrect results if they don't do any further parsing on it.
Terminating a command on context cancellation is fine but it's not okay to drain its output and then call wait as that is a racy read. The output is drained as otherwise the command could fail to exit if it blocks on writing to its stdout
since SIGTERM
doesn't just reap the process immediately like SIGKILL
would.
SIGTERM
is currently used as there isn't proper handling in place for stale data left over from killing the command, such as reference locks left by git. Such stale locks can currently block further writes until housekeeping gets around to clean them which can take a long time. SIGTERM
thus allows for the command to clean up.
Once &8911 is in place, we could move straight to killing the commands as it has logic in place to recover from crashes.