Subject with UTF-8 2-byte character encoded as 2 words (rfc2047.c)
Issue
A subject "Große wordxxxx word wordxx" (edited in utf-8 locale, "ß" is 2 bytes) in composing messages ends in the message as (raw)
Subject: =?iso-8859-1?Q?Gro=DF?= =?iso-8859-1?Q?e?= wordxxxx word wordxx
Although this is technically valid, these are consequences
- word "Große" is split into 2 encoded blocks
- I interpret RFC2047 that separating encoded blocks by space (without newline) means two words, so space should be displayed to the user as-is (Mutt doesn't display the space, but this probably is a different story)
- This causes interoperability problems: In iOS mail app the "ß" is displayed as invalid character (which also is insane behaviour, as encoding for the character is fine - well, the form of encoding by Mutt is also "special")
A quick check reveals that _rfc2047_encode_string returns (with line break, indent is a tab)
?iso-8859-1?Q?Gro=DF?=
=?iso-8859-1?Q?e?= wordxxxx word wordxx
This also is technically valid and to be considered as 1 word for "Große" (by the line break), but split encoding into 2 blocks is superfluous, triggering the remaining problems. While finally sending the email the line break / tab turns into space, which IMO semantically is wrong because changing the original 1 word to 2 words.
First Analysis
rfc2047.c / function rfc2047_encode this section (enriched with debug output)
t = t0;
for (;;)
{
/* Find how much we can encode. */
n = choose_block (t, t1 - t, col, icode, tocode, &encoder, &wlen);
dprint(5, (debugfile, "rfc2047en: n(choose_block#1)=%d t1=%d\n", n, t1 - u));
if (n == t1 - t)
{
/* See if we can fit the us-ascii suffix, too. */
if (col + wlen + (u + ulen - t1) <= ENCWORD_LEN_MAX + 1)
break;
n = t1 - t - 1;
if (icode)
while (CONTINUATION_BYTE(t[n]))
--n;
assert (t + n >= t);
if (!n)
{
/* This should only happen in the really stupid case where the
only word that needs encoding is one character long, but
there is too much us-ascii stuff after it to use a single
encoded word. We add the next word to the encoded region
and try again. */
assert (t1 < u + ulen);
for (t1++; t1 < u + ulen && !HSPACE(*t1); t1++)
;
continue;
}
dprint(5, (debugfile, "rfc2047en: n(choose_block#2)=%d t1=%d\n", n, t1 - u));
n = choose_block (t, n, col, icode, tocode, &encoder, &wlen);
}
}
produces
[2020-11-05 08:56:56] rfc2047en: n(choose_block#1)=6 t1=6
[2020-11-05 08:56:56] rfc2047en: n(choose_block#2)=5 t1=6
[2020-11-05 08:56:56] rfc2047en: n(choose_block#1)=1 t1=6
so the quoted code selects 5 bytes (= "Groß") for encoding.
How to start thinking about solutions? Could it be that counting bytes vs. characters (possibly containing multi-byte chars) is messed up? For me it makes more sense to work on byte level, so assuming this, then the part with n = t1 - t - 1 CONTINUATION_BYTE() (not doing any harm for this case) looks to be wrong here (or I don't understand the length definition for encode_block() correctly).