Problem handling incoming LMTP mails
Quack,
I've been migrating a MM2 installation with more traffic than what I experienced in the past and found out I was loosing mails.
Here is an example of mail, sent to MM3 after being scanned and accepted:
Sep 11 03:51:59 mail postfix/pickup[25737]: 13CE941F10: uid=995 from=<duck@redhat.com>
Sep 11 03:51:59 mail postfix/cleanup[25764]: 13CE941F10: message-id=<2d5ec929-5f0d-73a5-7342-84a4aa6dddbc@redhat.com>
Sep 11 03:51:59 mail postfix/qmgr[25738]: 13CE941F10: from=<duck@redhat.com>, size=4270, nrcpt=1 (queue active)
Sep 11 03:51:59 mail postfix/lmtp[25774]: 13CE941F10: to=<infra-private@ovirt.org>, relay=127.0.0.1[127.0.0.1]:8024, delay=4, delays=3.6/0.02/0.05/0.24, dsn=2.0.0, status=sent (250 Ok)
Sep 11 03:51:59 mail postfix/qmgr[25738]: 13CE941F10: removed
Unfortunately in MM3's log file (/var/log/mailman3/smtp.log):
Sep 11 03:51:59 2017 (25625) Peer: ('127.0.0.1', 54462)
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) handling connection
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) Data: b'LHLO mail.ovirt.org'
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) Data: b'MAIL FROM:<duck@redhat.com>'
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) sender: duck@redhat.com
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) Data: b'RCPT TO:<infra-private@ovirt.org>'
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) recip: infra-private@ovirt.org
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) Data: b'DATA'
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) Data: b'QUIT'
Sep 11 03:51:59 2017 (25625) ('127.0.0.1', 54462) connection lost
Sep 11 03:51:59 2017 (25625) Connection lost during _handle_client()
As Postfix thinks everything's alright, and MM3 stops processing the mail, then there is no further trace of these impacted mails and they are lost. Other mails are processed properly though and I could see mails sent to the ML members in Postfix's logs:
Sep 11 00:32:47 2017 (11753) <1930607729.358.1505099801948.JavaMail.jenkins@jenkins.phx.ovirt.org> smtp to infra@ovirt.org for 86 recips, completed
in 0.31727075576782227 seconds
Sep 11 00:32:47 2017 (11753) <1930607729.358.1505099801948.JavaMail.jenkins@jenkins.phx.ovirt.org> post to infra@ovirt.org from jenkins@jenkins.phx
.ovirt.org, 4506 bytes
I thought setting posfix/lmtp_skip_quit_response to 'no' would help, but no change.
I looked into the library used to handle LMTP (aiosmtpd) and found possibly related bugs:
- https://github.com/aio-libs/aiosmtpd/issues/127
- https://github.com/aio-libs/aiosmtpd/issues/110
- https://github.com/aio-libs/aiosmtpd/pull/59
The new (1.1) version of the library dropped Python 3.4 support, so I tried to backport the related patches… in vain.
There was an update of python34-aiosmtpd recently, but the problem is in fact older. Other instances do not have this problem at the moment, but this particular instance has much more traffic compared, so it appears quite often so I had to rollback the migration :-/.
So probably the problem may be in this library and not in MM3 directly but I'm not sure and it affects the software badly, so I'm opening this ticket to help sort this out.
Also all these instances were installed with the same Ansible rules. They also have the same versions of the MM3 components:
hyperkitty-selinux-1.1.2-0.1.el7.centos.noarch
hyperkitty-1.1.2-0.1.el7.centos.noarch
mailman3-hyperkitty-1.1.1-0.2.el7.centos.noarch
mailman3-selinux-3.1.1-0.3.el7.centos.noarch
python-django-mailman3-1.1.1-0.2.el7.centos.noarch
mailman3-3.1.1-0.3.el7.centos.noarch
mailman3-hyperkitty-1.1.1-0.2.el7.centos.noarch
python-mailman-client-3.1.1-0.3.el7.centos.noarch
python34-aiosmtpd-1.0-2.el7.noarch
libaio-0.3.109-13.el7.x86_64
Btw I originally got confused by https://github.com/mailman/mailman/ even if I'm not sure how I ended-up there. It would probably help clarifying things with the owner of this namespace to at least have some warning this is a non-official fork.
\_o<