master.py/master_state: os.kill throws PermissionError when PID matches an existing process
In file master.py, function master_state:
The test for a stale lock uses
try:
os.kill(pid,0)
except ProcessLookupError:
which works most of the time. I just discovered that it generates a PermissionError when the PID does exist but is owned by a different user (not a total surprise obviously). The problem comes in that it is possible for the lockfile to refer not just to a non-existent PID, but could also refer to a process that does exist, but is taking the process slot that used to be mailman started up while mailman was down.
Here is the scenario in which I encountered this: Linux (Debian 10.9)
There was an unexpected reboot, and it appears mailman did not have a clean shutdown. As the machine came up, Apache HTTPD started up and pre-launched 5 threads. (Linux 4.19 implements threads as light weight processes, so the process ID space is more likely to have an entry than in OSes where threads are thread_id within a PID). By chance Apache HTTPD came up with 2227 as one of its PIDs when the last boot mailman had gotten there first.
The end result was that mailman refused to start and the errors did not really give a lot of information.
Option 1: Stale Lock check could use the more general OSError exception to capture both the scenario of PID does not exist or is owned by someone else. eg
try:
os.kill(pid, 0)
return WatcherState.conflict, lock
except OSError:
# No matching process id.
return WatcherState.stale_lock, lock
Option 2: Stale Lock check could give more details in the failure (eg lockfile name) so that sysadmin can more easily figure out why the unit will not come up.
try:
os.kill(pid, 0)
return WatcherState.conflict, lock
except ProcessLookupError:
# No matching process id.
return WatcherState.stale_lock, lock
except:
print ( "Unknown failure checking lockfile ", lock_file, ":", sys.exc_info()[0])
raise
I am not sure which is better - give more details to allow manual intervention, or risk that mailman has been started as a different user in a previous iteration, or that the thrown exception is not either ProcessLookupError or PermissionError (in which case "catch (ProcessLookupError, PermissionError)".