Can't unshunt messages when arc is enabled and an initial Authentication-Results header has been added

I've now have a few instances where messages have ended up shunted and when I try to unshunt them, I find each attempt fails with "dkim.ParameterError: No existing chain found on message, cv should be none". This is both a different message to the original error, and materially modifies the message, meaning each further attempt fails with the same error, but the pck file grows. Net result the message doesn't go out to any users (a serious failure).

I recommend we make (at least) the following patch to make this area a bit more forgiving: /usr/local/lib/python3.6/dist-packages/mailman/handlers/arc_sign.py:

diff --git a/mail/mailman/fixes/mailman/arc_sign.py b/mail/mailman/fixes/mailman/arc_sign.py
index e6d76826..dbf01721 100644
--- a/mail/mailman/fixes/mailman/arc_sign.py
+++ b/mail/mailman/fixes/mailman/arc_sign.py
@@ -68,4 +68,8 @@ class ARCSign:
         """See `IHandler`."""

         if config.arc_enabled:
-            sign(msg, msgdata)
+            try:
+                sign(msg, msgdata)
+            except DKIMException:
+                # Ignore the failure, and send out the message
+                log.exception('Sending message without ARC headers: %s', msg['Message-ID'])

Clearly the above is a sticking plaster: so I'll now go into more detail about what's actually provoking the exception, in the hope you will be able to come up with a much more comprehensive fix. The original error was:

May 04 17:57:27 2021 (2812) Failed to unlink/preserve backup file: /opt/mailman/var/queue/in/1620151047.1194808+612bf479ca3ebf85e16b74e7c738702415326d66.bak
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/runner.py", line 158, in _one_iteration
    msg, msgdata = self.switchboard.dequeue(filebase)
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/switchboard.py", line 154, in dequeue
    os.rename(filename, backfile)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/mailman/var/queue/in/1620151047.1194808+612bf479ca3ebf85e16b74e7c738702415326d66.pck' -> '/opt/mailman/var/queue/in/1620151047.1194808+612bf479ca3ebf85e16b74e7c738702415326d66.bak'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/switchboard.py", line 181, in finish
    os.rename(bakfile, psvfile)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/mailman/var/queue/in/1620151047.1194808+612bf479ca3ebf85e16b74e7c738702415326d66.bak' -> '/opt/mailman/var/queue/bad/1620151047.1194808+612bf479ca3ebf85e16b74e7c738702415326d66.psv'
May 04 17:57:28 2021 (2816) Uncaught runner exception: lockf: lock unavailable: /opt/mailman/var/lists/pastoral.staff.teams.carrubbers.org/digest.mmdf
May 04 17:57:28 2021 (2816) Traceback (most recent call last):
  File "/usr/lib/python3.6/mailbox.py", line 2064, in _lock_file
    fcntl.lockf(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
BlockingIOError: [Errno 11] Resource temporarily unavailable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/runner.py", line 173, in _one_iteration
    self._process_one_file(msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/runner.py", line 266, in _process_one_file
    keepqueued = self._dispose(mlist, msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/runners/pipeline.py", line 37, in _dispose
    process(mlist, msg, msgdata, pipeline)
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/pipelines.py", line 50, in process
    handler.process(mlist, msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/handlers/to_digest.py", line 50, in process
    with Mailbox(mailbox_path, create=True) as mbox:
  File "/usr/local/lib/python3.6/dist-packages/mailman/utilities/mailbox.py", line 36, in __enter__
    self.lock()
  File "/usr/lib/python3.6/mailbox.py", line 640, in lock
    _lock_file(self._file)
  File "/usr/lib/python3.6/mailbox.py", line 2068, in _lock_file
    f.name)
mailbox.ExternalClashError: lockf: lock unavailable: /opt/mailman/var/lists/pastoral.staff.teams.carrubbers.org/digest.mmdf
May 04 17:57:28 2021 (2816) SHUNTING: 1620151048.5613232+cd100642f78a01f7e91ff7836f99ba25e1286f18
May 04 17:57:29 2021 (2820) Failed to unlink/preserve backup file: /opt/mailman/var/queue/digest/1620151048.6308672+c8b1cc0b8e58e82bc417cba42b39b88692e3b360.bak
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/switchboard.py", line 183, in finish
    os.unlink(bakfile)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/mailman/var/queue/digest/1620151048.6308672+c8b1cc0b8e58e82bc417cba42b39b88692e3b360.bak'

I am storing the spool on an AWS EFS share, and have more than one MailMan instance pointed to it. I had thought that was going to work, but it appears from the above that this is NOT the case (Would welcome confirmation that I'm correct here). My plan is therefore to continue to store the spool on EFS, but with a separate shard per EC2 host. I'll then have to deal with rescuing orphaned spools.

But my configuration has basically put pressure on the mailman code base, and has flushed out an error path which can hopefully be made more resilient.

The net result is a shunted file. The top of this file has this header (With ##### masking a portion of the actual email address):

Authentication-Results: mailman.planningsuite.com; dkim=fail; arc=none; dmarc=fail (Used From Domain Record) header.from=carrubbers.org policy.dmarc=none
...
...
From: David ##### <david.#####@carrubbers.org>
...
...

When I then unshunt, I expect it to work, or to get the same error, but instead get this:

May 04 21:23:25 2021 (2816) Uncaught runner exception: No existing chain found on message, cv should be none
May 04 21:23:25 2021 (2816) Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/runner.py", line 173, in _one_iteration
    self._process_one_file(msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/runner.py", line 266, in _process_one_file
    keepqueued = self._dispose(mlist, msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/runners/pipeline.py", line 37, in _dispose
    process(mlist, msg, msgdata, pipeline)
  File "/usr/local/lib/python3.6/dist-packages/mailman/core/pipelines.py", line 50, in process
    handler.process(mlist, msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/handlers/arc_sign.py", line 71, in process
    sign(msg, msgdata)
  File "/usr/local/lib/python3.6/dist-packages/mailman/handlers/arc_sign.py", line 50, in sign
    standardize=('ARC-Standardize' in msgdata))
  File "/usr/local/lib/python3.6/dist-packages/authheaders/__init__.py", line 321, in sign_message
    return ARC(msg, logger=logger).sign(selector, domain, privkey, srv_id, include_headers=sig_headers, timestamp=timestamp, standardize=standardize)
  File "/usr/local/lib/python3.6/dist-packages/dkim/__init__.py", line 1075, in sign
    raise ParameterError("No existing chain found on message, cv should be none")
dkim.ParameterError: No existing chain found on message, cv should be none
May 04 21:23:25 2021 (2816) SHUNTING: 1620163405.8278499+b6d8718da42aabd048c364e946c2a855f4862800

When I then inspect the new pck file, I find it has grown. The Authentication-Results header has changed, and the From has flipped (again ### masking the actual name and list address):

Authentication-Results: mailman.planningsuite.com; dkim=fail; arc=none; dmarc=fail header.from=carrubbers.org policy.dmarc=none; arc=none; dmarc=fail (Used From Domain Record) header.from=carrubbers.org policy.dmarc=none
...
...
From: "David #### via ####.####@Teams.carrubbers.org" <####.####@teams.carrubbers.org>

With each subsequent unshunt, the Authentication-Results/From/Reply-To headers grows/get-worse. After a few runs it looks like:

Authentication-Results: mailman.planningsuite.com; dkim=fail; arc=none; dmarc=fail header.from=carrubbers.org policy.dmarc=none; arc=none; dmarc=fail header.from=carrubbers.org policy.dmarc=none; arc=none; dmarc=fail header.from=Teams.carrubbers.org policy.dmarc=none; arc=none; dmarc=fail header.from=Teams.carrubbers.org policy.dmarc=none; arc=none; dmarc=fail (Used Org Domain Record) header.from=Teams.carrubbers.org policy.dmarc=none
...
From: "David #### via ####.####--- via ####.####--- via ####.####--- via ####.####@Teams.carrubbers.org" <####.####@teams.carrubbers.org>

Nigel

Assignee Loading
Time tracking Loading