UserNotification in languages with charsets other than UTF-8 can contain "?" instead of correct character

While migrating my Mailman 2 setup to Mailman 3 and doing a couple of tests before going live, I found a bug related to encoding which is IMHO painful to see nowadays: ? instead of a correct character.

How to reproduce

Environment

Environment: Debian Buster (packages from distribution), mailman3 3.2.1-1, mailman3-web 0+20180916-8, python3-django-mailman3 1.2.0-3, python3-django-postorius 1.2.4-1

Mailman version is not new but after looking into the code, the bug seems to be present with later versions as well.

What to do

Send a message with an UTF-8 quoted-printable encoded subject to the mailing list from a member of list which is under moderation, or a sender who is not a member of the list.
The sender gets a notification message "Your message to $LIST awaits moderator approval". The message quotes the subject. The subject was properly decoded from quoted-printable representation to normal UTF-8 but the send notification is encoded in US-ASCII because the message was send in English and Mailman defaults English to US-ASCII. Characters such as ä are not available in US-ASCII.

In my test setting, the preferred language of the mailing list is 'en'. The user sending the message (in this case equal to the moderator/admin of the list receiving the notification) has no preferred language set.

That's the notification message the sender receives (Received, Return-Path, Delivered-To, various DKIM headers removed):

MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Subject: Your message to mailman-admins@openrailwaymap.org awaits moderator approval
From: mailman-admins-bounces@openrailwaymap.org
To: info@openrailwaymap.org
Message-ID: <157995890415.14789.4411382105354409402@buegelfalte.openrailwaymap.org>
Date: Sat, 25 Jan 2020 14:28:24 +0100
Precedence: bulk
X-Mailman-Version: 3.2.1

Your mail to 'mailman-admins@openrailwaymap.org' with the subject

    Test auf Durchl?ssigkeit

Is being held until the list moderator can review it for approval.

The message is being held because:

    The message comes from a moderated member

Either the message will get posted to the list, or you will receive
notification of the moderator's decision.

The original message to the list was:

Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
From: info@openrailwaymap.org
To: mailman-admins@openrailwaymap.org
Reply-To: 
Subject: =?utf-8?q?Test_auf_Durchl=C3=A4ssigkeit?=
Message-ID: <157996353560.32608.10424899781661842519.@127.0.0.1>
Date: Sat, 25 Jan 2020 14:45:35 -0000

Hallo,

=C3=BCblicherweise muss man Abonnent einer Mailingliste sein, um Mails an
diese senden zu d=C3=BCrfen. Ist das bei euch auch der Fall?

Viele Gr=C3=BC=C3=9Fe

Eurer Spammer

The bug and its solution

The bug is that Mailman sends messages in US-ASCII although some characters in that message cannot be encoded in US-ASCII. While the templates might not contain such characters, quoted parts of the original message (subject, the message itself etc.) can contain such characters.

There are two solutions:

Use a better charset if the message cannot be encoded in US-ASCII (or ISO-8859-1 for languages where Mailman uses ISO-8859-1).
Send everything by default in UTF-8 (my suggestion to keep the code simple).

In commit e7e8c1d1, @msapiro changed text.encode(charset) to text.encode(charset, errors='replace'). While this avoids the code throwing exceptions, it does not fix the issue for end users and I myself see this as a quick fix, not a proper solution. In #437 (comment 46932629) @msapiro explains it that users should be able to search through their mailboxes using Grep and that the Python email library sends UTF-8 messages by default in Base64.

I myself faced the Base64 issue when I wrote scripts to send emails in Python. However, Python's email library is able to messages in quoted-printable. Quoted-printable replaces those characters only which need to be replaced. The original message quoted above demonstrates that. A pure ASCII message stays pure ASCII.

The following patch demonstrates how to send UserNotifications in UTF-8 quoted-printable.

diff --git a/src/mailman/email/message.py b/src/mailman/email/message.py
diff --git a/src/mailman/email/message.py b/src/mailman/email/message.py
index 3ab348b64..88ec95ad0 100644
--- a/src/mailman/email/message.py
+++ b/src/mailman/email/message.py
@@ -27,6 +27,7 @@ import email
 import email.message
 import email.utils
 
+from email.charset import Charset, SHORTEST, QP
 from email.header import Header
 from email.mime.multipart import MIMEMultipart
 from mailman.config import config
@@ -134,10 +135,14 @@ class UserNotification(Message):
 
     def __init__(self, recipients, sender, subject=None, text=None, lang=None):
         Message.__init__(self)
-        charset = (lang.charset if lang is not None else 'us-ascii')
+        cs = (lang.charset if lang is not None else 'utf-8')
+        charset = Charset(cs)
+        charset.header_encoding = SHORTEST
+        # Body encoding cannot be the shortest of quoted-printable or base64.
+        charset.body_encoding = QP
         subject = ('(no subject)' if subject is None else subject)
         if text is not None:
-            self.set_payload(text.encode(charset, errors='replace'), charset)
+            self.set_payload(text.encode(charset.input_charset, errors='replace'), charset)
         self['Subject'] = Header(
             subject, charset, header_name='Subject', errors='replace')
         self['From'] = sender

The only issue I see with this approach is that messages containing mainly multibyte characters (e.g. messages with lots of Kanji characters) are longer in quoted-printable encoding than in Base64. That could be addressed by

either encoding the body with both quoted-printable and Base64 using email.charset.Charset.body_encode(str) and choosing the shorter one,
or adding a new option to Mailman configuration to disable quoted-printable for a given language (and enabling it for Japanese, Chinese etc.).