The master watcher doesn't reliably restart or not restart child processes.
It seemed that the master would restart a child killed by a SIGUSR1, but not one that exited. !858 (merged) was an attempt to fix that, but it seems that there are issues with that and I'm going to revert it. Prior to !858 (merged), the master would restart a child that stopped due to receiving a SIGUSR1 but not another signal. This was correct. However, it should also restart a child that exited, and it wasn't doing that unless coincidently, the exit status had the same numeric value as SIGUSR1, i.e. 10. !858 (merged) was an attempt to fix that, but it seems it creates other problems. This is a log from stopping the systemd mailman service.
May 12 17:23:16 2021 (8603) Master watcher caught SIGTERM. Exiting.
May 12 17:23:16 2021 (8618) lmtp runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8619) nntp runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8616) command runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8617) in runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8615) bounces runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8624) task runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8618) lmtp runner exiting.
May 12 17:23:16 2021 (8616) command runner exiting.
May 12 17:23:16 2021 (8619) nntp runner exiting.
May 12 17:23:16 2021 (8615) bounces runner exiting.
May 12 17:23:16 2021 (8617) in runner exiting.
May 12 17:23:16 2021 (8624) task runner exiting.
May 12 17:23:16 2021 (8621) pipeline runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8621) pipeline runner exiting.
May 12 17:23:16 2021 (8626) digest runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8626) digest runner exiting.
May 12 17:23:16 2021 (8623) retry runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8625) virgin runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8623) retry runner exiting.
May 12 17:23:16 2021 (8625) virgin runner exiting.
May 12 17:23:16 2021 (8614) archive runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8620) out runner caught SIGTERM. Stopping.
May 12 17:23:16 2021 (8614) archive runner exiting.
May 12 17:23:16 2021 (8620) out runner exiting.
[2021-05-12 17:23:16 -0700] [8622] [INFO] Handling signal: term
[2021-05-12 17:23:16 -0700] [8645] [INFO] Worker exiting (pid: 8645)
[2021-05-12 17:23:16 -0700] [8647] [INFO] Worker exiting (pid: 8647)
May 12 17:23:16 2021 (8603) Master watcher caught SIGTERM. Exiting.
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8619, why: SIGNAL 15, class: nntp, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8615, why: SIGNAL 15, class: bounces, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8614, why: SIGNAL 15, class: archive, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8624, why: SIGNAL 15, class: task, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8618, why: SIGNAL 15, class: lmtp, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8616, why: SIGNAL 15, class: command, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8626, why: SIGNAL 15, class: digest, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8620, why: SIGNAL 15, class: out, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8623, why: SIGNAL 15, class: retry, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8621, why: SIGNAL 15, class: pipeline, slice: 1/1)
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8625, why: SIGNAL 15, class: virgin, slice: 1/1)
[2021-05-12 17:23:16 -0700] [8622] [INFO] Shutting down: Master
May 12 17:23:16 2021 (8603) Master detected subprocess exit
(pid: 8622, why: EXIT 0, class: rest, slice: 1/1) [restarting]
May 12 17:23:16 2021 (9437) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=rest:0:1']
May 12 17:23:17 2021 (9437) rest runner started.
[2021-05-12 17:23:17 -0700] [9437] [INFO] Starting gunicorn 20.0.4
[2021-05-12 17:23:17 -0700] [9437] [INFO] Listening at: http://127.0.0.1:8001 (9437)
[2021-05-12 17:23:17 -0700] [9437] [INFO] Using worker: sync
[2021-05-12 17:23:17 -0700] [9445] [INFO] Booting worker with pid: 9445
[2021-05-12 17:23:17 -0700] [9446] [INFO] Booting worker with pid: 9446
Note that the master detected most children reported catching SIGTERM and the master sees them exiting with SIGTERM (signal 15), but the rest runner (pid 8622) just exits so the master restarts it. Then ultimately, the systemd service kills the master and the new rest runner and workers, but leaves the master lock set.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: State 'stop-sigterm' timed out. Killing.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Killing process 8603 (python3) with signal SIGKILL.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Killing process 9437 (python3) with signal SIGKILL.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Killing process 9445 (python3) with signal SIGKILL.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Killing process 9446 (python3) with signal SIGKILL.
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Main process exited, code=killed, status=9/KILL
May 12 17:24:46 msapiro systemd[1]: mailman3.service: Failed with result 'timeout'.
Removing the lock and starting the systemd mailman service produces
May 12 17:34:13 2021 (12778) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/master', '-C', '/opt/mailman/mm/mailman.cfg']
May 12 17:34:14 2021 (12778) [12790] archive:0:1
May 12 17:34:14 2021 (12790) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=archive:0:1']
May 12 17:34:14 2021 (12778) [12791] bounces:0:1
May 12 17:34:14 2021 (12791) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=bounces:0:1']
May 12 17:34:14 2021 (12778) [12792] command:0:1
May 12 17:34:14 2021 (12792) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=command:0:1']
May 12 17:34:14 2021 (12778) [12793] in:0:1
May 12 17:34:14 2021 (12793) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=in:0:1']
May 12 17:34:14 2021 (12778) [12794] lmtp:0:1
May 12 17:34:14 2021 (12794) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=lmtp:0:1']
May 12 17:34:14 2021 (12778) [12795] nntp:0:1
May 12 17:34:14 2021 (12795) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=nntp:0:1']
May 12 17:34:14 2021 (12778) [12796] out:0:1
May 12 17:34:14 2021 (12796) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=out:0:1']
May 12 17:34:14 2021 (12778) [12797] pipeline:0:1
May 12 17:34:14 2021 (12778) [12798] rest:0:1
May 12 17:34:14 2021 (12778) [12799] retry:0:1
May 12 17:34:14 2021 (12778) [12800] task:0:1
May 12 17:34:14 2021 (12797) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=pipeline:0:1']
May 12 17:34:14 2021 (12798) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=rest:0:1']
May 12 17:34:14 2021 (12778) [12801] virgin:0:1
May 12 17:34:14 2021 (12778) [12802] digest:0:1
May 12 17:34:14 2021 (12778) Master started
May 12 17:34:14 2021 (12799) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=retry:0:1']
May 12 17:34:14 2021 (12800) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=task:0:1']
May 12 17:34:14 2021 (12801) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=virgin:0:1']
May 12 17:34:14 2021 (12802) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=digest:0:1']
May 12 17:34:16 2021 (12793) in runner started.
May 12 17:34:16 2021 (12790) archive runner started.
May 12 17:34:16 2021 (12802) digest runner started.
May 12 17:34:16 2021 (12797) pipeline runner started.
May 12 17:34:16 2021 (12795) nntp runner started.
May 12 17:34:16 2021 (12794) LMTP server listening on 127.0.0.1:8024
May 12 17:34:16 2021 (12794) lmtp runner started.
May 12 17:34:17 2021 (12798) rest runner started.
[2021-05-12 17:34:17 -0700] [12798] [INFO] Starting gunicorn 20.0.4
[2021-05-12 17:34:17 -0700] [12798] [INFO] Listening at: http://127.0.0.1:8001 (12798)
[2021-05-12 17:34:17 -0700] [12798] [INFO] Using worker: sync
[2021-05-12 17:34:17 -0700] [12827] [INFO] Booting worker with pid: 12827
[2021-05-12 17:34:17 -0700] [12830] [INFO] Booting worker with pid: 12830
May 12 17:34:17 2021 (12800) task runner started.
May 12 17:34:17 2021 (12800) Task runner evicted 0 expired pendings
May 12 17:34:17 2021 (12800) Task runner deleted 0 orphaned workflows
May 12 17:34:17 2021 (12800) Task runner deleted 0 orphaned requests
May 12 17:34:17 2021 (12800) Task runner deleted 0 orphaned messages
May 12 17:34:17 2021 (12800) Task runner evicted expired cache entries
May 12 17:34:17 2021 (12799) retry runner started.
May 12 17:34:18 2021 (12792) command runner started.
May 12 17:34:18 2021 (12791) bounces runner started.
May 12 17:34:18 2021 (12796) out runner started.
May 12 17:34:19 2021 (12801) virgin runner started.
which is normal except for the stale lock needing to be removed. But now stopping Mailman with
mailman stop
produces
May 12 17:48:38 2021 (12790) archive runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12794) lmtp runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12778) Master watcher caught SIGTERM. Exiting.
May 12 17:48:38 2021 (12791) bounces runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12795) nntp runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12792) command runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12799) retry runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12794) lmtp runner exiting.
May 12 17:48:38 2021 (12790) archive runner exiting.
May 12 17:48:38 2021 (12791) bounces runner exiting.
May 12 17:48:38 2021 (12795) nntp runner exiting.
May 12 17:48:38 2021 (12792) command runner exiting.
May 12 17:48:38 2021 (12799) retry runner exiting.
May 12 17:48:38 2021 (12796) out runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12796) out runner exiting.
May 12 17:48:38 2021 (12800) task runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12800) task runner exiting.
[2021-05-12 17:48:38 -0700] [12798] [INFO] Handling signal: term
[2021-05-12 17:48:38 -0700] [12830] [INFO] Worker exiting (pid: 12830)
May 12 17:48:38 2021 (12802) digest runner caught SIGTERM. Stopping.
[2021-05-12 17:48:38 -0700] [12827] [INFO] Worker exiting (pid: 12827)
May 12 17:48:38 2021 (12802) digest runner exiting.
May 12 17:48:38 2021 (12801) virgin runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12801) virgin runner exiting.
May 12 17:48:38 2021 (12797) pipeline runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12797) pipeline runner exiting.
May 12 17:48:38 2021 (12793) in runner caught SIGTERM. Stopping.
May 12 17:48:38 2021 (12793) in runner exiting.
May 12 17:48:38 2021 (12778) Master detected subprocess exit
(pid: 12791, why: EXIT 15, class: bounces, slice: 1/1) [restarting]
May 12 17:48:38 2021 (17473) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=bounces:0:1']
May 12 17:48:38 2021 (12778) Master detected subprocess exit
(pid: 12794, why: EXIT 15, class: lmtp, slice: 1/1) [restarting]
May 12 17:48:38 2021 (17474) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=lmtp:0:1']
May 12 17:48:38 2021 (12778) Master detected subprocess exit
(pid: 12796, why: EXIT 15, class: out, slice: 1/1) [restarting]
May 12 17:48:38 2021 (17475) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=out:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12790, why: EXIT 15, class: archive, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17476) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=archive:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12792, why: EXIT 15, class: command, slice: 1/1) [restarting]
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12802, why: EXIT 15, class: digest, slice: 1/1) [restarting]
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12801, why: EXIT 15, class: virgin, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17478) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=digest:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12795, why: EXIT 15, class: nntp, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17477) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=command:0:1']
May 12 17:48:39 2021 (17479) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=virgin:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12799, why: EXIT 15, class: retry, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17480) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=nntp:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12793, why: EXIT 15, class: in, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17481) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=retry:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12800, why: EXIT 15, class: task, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17482) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=in:0:1']
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12797, why: EXIT 15, class: pipeline, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17485) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=pipeline:0:1']
May 12 17:48:39 2021 (17483) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=task:0:1']
[2021-05-12 17:48:39 -0700] [12798] [INFO] Shutting down: Master
May 12 17:48:39 2021 (12778) Master detected subprocess exit
(pid: 12798, why: EXIT 0, class: rest, slice: 1/1) [restarting]
May 12 17:48:39 2021 (17488) starting: ['/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/python3', '/opt/mailman/mm/venv/bin/runner', '-C', '/opt/mailman/mm/mailman.cfg', '--runner=rest:0:1']
May 12 17:48:41 2021 (17473) bounces runner started.
May 12 17:48:41 2021 (17477) command runner started.
May 12 17:48:41 2021 (17476) archive runner started.
May 12 17:48:41 2021 (17481) retry runner started.
May 12 17:48:41 2021 (17479) virgin runner started.
May 12 17:48:41 2021 (17483) task runner started.
May 12 17:48:41 2021 (17483) Task runner evicted 0 expired pendings
May 12 17:48:41 2021 (17483) Task runner deleted 0 orphaned workflows
May 12 17:48:41 2021 (17483) Task runner deleted 0 orphaned requests
May 12 17:48:41 2021 (17483) Task runner deleted 0 orphaned messages
May 12 17:48:41 2021 (17483) Task runner evicted expired cache entries
May 12 17:48:41 2021 (17488) rest runner started.
[2021-05-12 17:48:41 -0700] [17488] [INFO] Starting gunicorn 20.0.4
[2021-05-12 17:48:41 -0700] [17488] [INFO] Listening at: http://127.0.0.1:8001 (17488)
[2021-05-12 17:48:41 -0700] [17488] [INFO] Using worker: sync
[2021-05-12 17:48:41 -0700] [17509] [INFO] Booting worker with pid: 17509
May 12 17:48:41 2021 (17480) nntp runner started.
[2021-05-12 17:48:41 -0700] [17511] [INFO] Booting worker with pid: 17511
May 12 17:48:41 2021 (17475) out runner started.
May 12 17:48:41 2021 (17485) pipeline runner started.
May 12 17:48:42 2021 (17474) LMTP server listening on 127.0.0.1:8024
May 12 17:48:42 2021 (17474) lmtp runner started.
May 12 17:48:43 2021 (17482) in runner started.
May 12 17:48:43 2021 (17478) digest runner started.
But for some reason, the master sees the SIGTERM as an exit status and so restarts everything. So somehow, os.wait() is reporting the signal as an exit status in this case. This needs further investigation, but in ther mean time, I'm reverting !858 (merged).