UserNotification in languages with charsets other than UTF-8 can contain "?" instead of correct character
While migrating my Mailman 2 setup to Mailman 3 and doing a couple of tests before going live, I found a bug related to encoding which is IMHO painful to see nowadays: ? instead of a correct character.
How to reproduce
Environment
Environment: Debian Buster (packages from distribution), mailman3 3.2.1-1, mailman3-web 0+20180916-8, python3-django-mailman3 1.2.0-3, python3-django-postorius 1.2.4-1
Mailman version is not new but after looking into the code, the bug seems to be present with later versions as well.
What to do
- Send a message with an UTF-8 quoted-printable encoded subject to the mailing list from a member of list which is under moderation, or a sender who is not a member of the list.
- The sender gets a notification message "Your message to $LIST awaits moderator approval". The message quotes the subject. The subject was properly decoded from quoted-printable representation to normal UTF-8 but the send notification is encoded in US-ASCII because the message was send in English and Mailman defaults English to US-ASCII. Characters such as ä are not available in US-ASCII.
In my test setting, the preferred language of the mailing list is 'en'. The user sending the message (in this case equal to the moderator/admin of the list receiving the notification) has no preferred language set.
That's the notification message the sender receives (Received, Return-Path, Delivered-To, various DKIM headers removed):
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Subject: Your message to mailman-admins@openrailwaymap.org awaits moderator approval
From: mailman-admins-bounces@openrailwaymap.org
To: info@openrailwaymap.org
Message-ID: <157995890415.14789.4411382105354409402@buegelfalte.openrailwaymap.org>
Date: Sat, 25 Jan 2020 14:28:24 +0100
Precedence: bulk
X-Mailman-Version: 3.2.1
Your mail to 'mailman-admins@openrailwaymap.org' with the subject
Test auf Durchl?ssigkeit
Is being held until the list moderator can review it for approval.
The message is being held because:
The message comes from a moderated member
Either the message will get posted to the list, or you will receive
notification of the moderator's decision.
The original message to the list was:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
From: info@openrailwaymap.org
To: mailman-admins@openrailwaymap.org
Reply-To:
Subject: =?utf-8?q?Test_auf_Durchl=C3=A4ssigkeit?=
Message-ID: <157996353560.32608.10424899781661842519.@127.0.0.1>
Date: Sat, 25 Jan 2020 14:45:35 -0000
Hallo,
=C3=BCblicherweise muss man Abonnent einer Mailingliste sein, um Mails an
diese senden zu d=C3=BCrfen. Ist das bei euch auch der Fall?
Viele Gr=C3=BC=C3=9Fe
Eurer Spammer
The bug and its solution
The bug is that Mailman sends messages in US-ASCII although some characters in that message cannot be encoded in US-ASCII. While the templates might not contain such characters, quoted parts of the original message (subject, the message itself etc.) can contain such characters.
There are two solutions:
- Use a better charset if the message cannot be encoded in US-ASCII (or ISO-8859-1 for languages where Mailman uses ISO-8859-1).
- Send everything by default in UTF-8 (my suggestion to keep the code simple).
In commit e7e8c1d1, @msapiro changed text.encode(charset)
to text.encode(charset, errors='replace')
. While this avoids the code throwing exceptions, it does not fix the issue for end users and I myself see this as a quick fix, not a proper solution. In #437 (comment 46932629) @msapiro explains it that users should be able to search through their mailboxes using Grep and that the Python email library sends UTF-8 messages by default in Base64.
I myself faced the Base64 issue when I wrote scripts to send emails in Python. However, Python's email library is able to messages in quoted-printable. Quoted-printable replaces those characters only which need to be replaced. The original message quoted above demonstrates that. A pure ASCII message stays pure ASCII.
The following patch demonstrates how to send UserNotifications in UTF-8 quoted-printable.
diff --git a/src/mailman/email/message.py b/src/mailman/email/message.py
diff --git a/src/mailman/email/message.py b/src/mailman/email/message.py
index 3ab348b64..88ec95ad0 100644
--- a/src/mailman/email/message.py
+++ b/src/mailman/email/message.py
@@ -27,6 +27,7 @@ import email
import email.message
import email.utils
+from email.charset import Charset, SHORTEST, QP
from email.header import Header
from email.mime.multipart import MIMEMultipart
from mailman.config import config
@@ -134,10 +135,14 @@ class UserNotification(Message):
def __init__(self, recipients, sender, subject=None, text=None, lang=None):
Message.__init__(self)
- charset = (lang.charset if lang is not None else 'us-ascii')
+ cs = (lang.charset if lang is not None else 'utf-8')
+ charset = Charset(cs)
+ charset.header_encoding = SHORTEST
+ # Body encoding cannot be the shortest of quoted-printable or base64.
+ charset.body_encoding = QP
subject = ('(no subject)' if subject is None else subject)
if text is not None:
- self.set_payload(text.encode(charset, errors='replace'), charset)
+ self.set_payload(text.encode(charset.input_charset, errors='replace'), charset)
self['Subject'] = Header(
subject, charset, header_name='Subject', errors='replace')
self['From'] = sender
The only issue I see with this approach is that messages containing mainly multibyte characters (e.g. messages with lots of Kanji characters) are longer in quoted-printable encoding than in Base64. That could be addressed by
- either encoding the body with both quoted-printable and Base64 using email.charset.Charset.body_encode(str) and choosing the shorter one,
- or adding a new option to Mailman configuration to disable quoted-printable for a given language (and enabling it for Japanese, Chinese etc.).
See also #542 (closed) !445 (closed)