Skip to content
Snippets Groups Projects

Fix restarting of the runner processes

Merged Justus Winter requested to merge teythoon/mailman:justus/fix-waiting-for-children into master
All threads resolved!

This series fixes a couple of issuer related to watching over and restarting the runner processes.

Fixes #898 (closed)

Edited by Justus Winter

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Abhilash Raj changed the description

    changed the description

  • Also, you can ignore the diffcov for now, we haven't been able to figure out a good fix for the latest git error.

  • Justus Winter added 1 commit

    added 1 commit

    Compare with previous version

  • This is really old code, I think something like it even existed back in the MM2 days, but I don't remember. SIGUSR1 is explicitly used by the restart command and the logic you pasted is the implementation of restart.

  • Justus Winter added 5 commits

    added 5 commits

    • 952a9d11 - Fix waiting for children.
    • 44a925ca - Fix log messages related to exiting and restarting.
    • b6fead01 - Do not count explicit restarts against the max_restart limit.
    • 35b325f3 - Always restart runner processes.
    • 9f077581 - Gracefully handle a race condition in the signal handlers.

    Compare with previous version

  • Author Contributor

    I reworked the series. It now restarts the runner in most conditions. mailman {stop,restart,reload} work as expected. I did some cleanups along the way. Please review!

  • Justus Winter changed title from Fix waiting for children. to Fix restarting of the runner processes

    changed title from Fix waiting for children. to Fix restarting of the runner processes

  • Justus Winter changed the description

    changed the description

  • Mark Sapiro resolved all threads

    resolved all threads

  • There are coverage issues. You need to run tox -e py37-diffcov locally (or any Python version >=3.6) and add appropriate tests if possible or mark the uncovered lines with # pragma: nocover so diffcov passes. These are the issues:

    Diff Coverage
    Diff: origin/master...HEAD, staged and unstaged changes
    -------------
    src/mailman/bin/master.py (19.2%): Missing lines 280-283,289-293,300,302-305,312,314-318,415
    -------------
    Total:   26 lines
    Missing: 21 lines
    Coverage: 19%
    -------------
  • It does seem that this fixes the issues that caused me to revert !858 (merged) so it looks good for that.

  • Justus Winter added 6 commits

    added 6 commits

    • 9cf5ba05 - Fix waiting for children.
    • daa42119 - Add tests for the master runner's signal handlers.
    • fa57b10c - Fix log messages related to exiting and restarting.
    • 5024bebb - Do not count explicit restarts against the max_restart limit.
    • f337b45e - Gracefully handle a race condition in the signal handlers.
    • 0abd19a6 - Always restart runner processes.

    Compare with previous version

  • I stopped the pipeline because tests were hung. I'm not certain what's hapenning. When I run tox -e py37-diffcov locally, I see multiple failures and errors, but I can't see what they are because tox hangs and I have to interrupt it, however, if I run tox -e py37-diffcov -- mailman.bin.tests.test_master I get no errors and 1 failure

    ======================================================================
    FAIL: test_sighup_handler (mailman.bin.tests.test_master.TestMaster)
    Invokes the SIGHUP handler.
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/home/mark/zzz/mailman/src/mailman/bin/tests/test_master.py", line 178, in test_sighup_handler
        self.assertEqual(len(list(m._kids)), 1 + 1)
    AssertionError: 1 != 2
    
    ----------------------------------------------------------------------
    Ran 9 tests in 14.170s
    
    FAILED (failures=1)

    and if I remove the added tests and run tox -e py37-diffcov, it passes with 3 expected test failures and

    Total:   37 lines
    Missing: 0 lines
    Coverage: 100%

    It is a total mystery to me what's going on

  • It occurred to me that the issue with the errors and hanging tests is because of some side effect from the new tests that isn't being cleared at the termination of the test. I suspect it has something to do with the forked processes from unused_pid().

  • Author Contributor

    Odd indeed. I'll back out that commit to see if it is indeed related to the unused pid thing.

  • Justus Winter added 1 commit

    added 1 commit

    • 2479afb7 - Revert "Gracefully handle a race condition in the signal handlers."

    Compare with previous version

  • Justus Winter added 2 commits

    added 2 commits

    • 0be43741 - Fix the logging paths in the generated test configuration.
    • 02199db6 - fixup! Add tests for the master runner's signal handlers.

    Compare with previous version

  • Justus Winter marked this merge request as draft from teythoon/mailman@02199db6

    marked this merge request as draft from teythoon/mailman@02199db6

  • Justus Winter added 6 commits

    added 6 commits

    • 4db974ce - Fix the logging paths in the generated test configuration.
    • eed68869 - Add tests for the master runner's signal handlers.
    • 00d86f1e - Fix log messages related to exiting and restarting.
    • e69e4c29 - Do not count explicit restarts against the max_restart limit.
    • d81d198a - Gracefully handle a race condition in the signal handlers.
    • ac25fffc - Always restart runner processes.

    Compare with previous version

  • Justus Winter added 1 commit

    added 1 commit

    • 955e65af - Always restart runner processes.

    Compare with previous version

  • Justus Winter added 1 commit

    added 1 commit

    • 9402f42c - ci: Add more safe directories.

    Compare with previous version

  • Justus Winter added 1 commit

    added 1 commit

    • 3dba5ff5 - ci: Add more safe directories.

    Compare with previous version

  • Justus Winter marked this merge request as ready

    marked this merge request as ready

  • Author Contributor

    I suspect it has something to do with the forked processes from unused_pid().

    Good call. I opted to nocover the trivial exception handlers instead. Injecting a dead pid was a dodgy way to simulate that race condition anyway.

    I replaced the one racy sleep with inspecting of the log files instead. Works much nicer, and I also ensure that the runners signal handler is run by peeking at the logs.

    One odd thing is that even though the runners orderly exit using sys.exit when they catch say SIGTERM, the status signaled to the master process sometimes makes WIFSIGNALED true, and sometimes makes WIFEXITED true. I have no explanation for that. To cope with that, I changed the code to no longer consider sig_or_exit.

    All in all I'm very happy with the patch series now. Please have another look.

  • @teythoon Thank you for all your work on this. It's much appreciated.

  • Mark Sapiro started a merge train

    started a merge train

  • Mark Sapiro mentioned in commit 2e355573

    mentioned in commit 2e355573

  • merged

  • Author Contributor

    Thanks for working with me on this patch series and thank you for merging it.

    However, I am very sad that the patch series was squashed into one commit.

    First, I tend to spend time and effort to structure my changes in a way that transforms the source from one consistent state to another in atomic and semantically related changes (as much as possible, of course), in an effort to present the changes to whoever will read them, during the patch review or much later. There is reasoning in my commit messages, around 1500 characters or so. In the squashed commit there are barely 150 characters.

    Second, this misrepresents the work I did. Not only does it misrepresent the effort and care I put into developing it, it also breaks the chain of custody: who can tell whether changes were added or omitted during the squashing? Relatedly, it also stripped all the cryptographic signatures from the commits.

  • However, I am very sad that the patch series was squashed into one commit.

    One of my biggest complaints about git, and something I really miss from bzr, is the notion of a "main line of development". This strongly pushes the git user mentality that squash merges are the best way to maintain main branch hygiene. I had nothing to do with this specific merge, but I probably would have done it the same way. You see this workflow everywhere in git. I long ago gave up trying to fight it. Sigh.

  • To add to what Barry said, yes, we do squash to commits and that is currently the default way of merging the commits.

    Second, this misrepresents the work I did. Not only does it misrepresent the effort and care I put into developing it, it also breaks the chain of custody: who can tell whether changes were added or omitted during the squashing?

    The squashed commit has the reference to the pull request you submitted and the repo will retain all the original commits you created. If you want, you can indeed verify the contents of the squashed commit and the pull request.

    Documents/mm3/core ‹master*› » git show-ref | grep refs/pullreqs/1094
    3dba5ff55715388c3a0b4f04feafc1b9c2ac5cd3 refs/pullreqs/1094
    Documents/mm3/core ‹master*› » git co refs/pullreqs/1094

    Relatedly, it also stripped all the cryptographic signatures from the commits.

    Yes, it did strip it in the squashed commit, because there is no way to retain it. Github actually signs the squashed commit with it's own key, although I question the value in that not knowing how the private key is handled within Github.

    But you can verify the signature on the original commits in Mailman's repo (all original commits are retained):

    Documents/mm3/core ‹3dba5ff55*› » git verify-commit 3dba5ff55715388c3a0b4f04feafc1b9c2ac5cd3                141 ↵
    gpg: Signature made Thu Jan 12 22:04:47 2023 IST
    gpg:                using EDDSA key D1FE45FB978F6B65C4C0B9AA686F55B4AB2B3386
    gpg: key 31855247603831FD: no user ID
    gpg: Total number processed: 1
    gpg: Can't check signature: No public key

    There is reasoning in my commit messages, around 1500 characters or so. In the squashed commit there are barely 150 characters.

    For whatever reason, I used to think that the original commit messages are appended to the squashed commit, but looks like that's not the case. I have updated the Gitlab settings to include commit messages from all the original commits so they aren't hard to find (using what I mentioned above).

    Edited by Abhilash Raj
  • Author Contributor

    I understand that it is possible to verify that no changes were made to my patch series by comparing the squashed version with the original, then to verify the signatures on the original commits.

    I understand that squashing the commits necessarily invalidates any signatures. I agree that what Github does is of little value, even more so since their OpenPGP implementation produces malformed signatures.

    I further agree that including the original commit messages in the squashed commit would be better than what happened here, but the association between message and change would no longer be as obvious.

    My feedback is that all of these issues could be avoided by not squashing the commits, and doing either fast-forward merges or regular merges.

    (Context: I work on an OpenPGP implementation and we have a project going on related to git-based supply chain security, hence my interest.)

Please register or sign in to reply
Loading