Subject with UTF-8 2-byte character encoded as 2 words (rfc2047.c)

Issue

A subject "Große wordxxxx word wordxx" (edited in utf-8 locale, "ß" is 2 bytes) in composing messages ends in the message as (raw)

Subject: =?iso-8859-1?Q?Gro=DF?= =?iso-8859-1?Q?e?= wordxxxx word wordxx

Although this is technically valid, these are consequences

  • word "Große" is split into 2 encoded blocks
  • I interpret RFC2047 that separating encoded blocks by space (without newline) means two words, so space should be displayed to the user as-is (Mutt doesn't display the space, but this probably is a different story)
  • This causes interoperability problems: In iOS mail app the "ß" is displayed as invalid character (which also is insane behaviour, as encoding for the character is fine - well, the form of encoding by Mutt is also "special")

A quick check reveals that _rfc2047_encode_string returns (with line break, indent is a tab)

?iso-8859-1?Q?Gro=DF?=
        =?iso-8859-1?Q?e?= wordxxxx word wordxx

This also is technically valid and to be considered as 1 word for "Große" (by the line break), but split encoding into 2 blocks is superfluous, triggering the remaining problems. While finally sending the email the line break / tab turns into space, which IMO semantically is wrong because changing the original 1 word to 2 words.

First Analysis

rfc2047.c / function rfc2047_encode this section (enriched with debug output)

    t = t0;
    for (;;)
    {
      /* Find how much we can encode. */
      n = choose_block (t, t1 - t, col, icode, tocode, &encoder, &wlen);
      dprint(5, (debugfile, "rfc2047en: n(choose_block#1)=%d t1=%d\n", n, t1 - u));
      if (n == t1 - t)
      {
        /* See if we can fit the us-ascii suffix, too. */
        if (col + wlen + (u + ulen - t1) <= ENCWORD_LEN_MAX + 1)
          break;
        n = t1 - t - 1;
        if (icode)
          while (CONTINUATION_BYTE(t[n]))
            --n;
        assert (t + n >= t);
        if (!n)
        {
          /* This should only happen in the really stupid case where the
             only word that needs encoding is one character long, but
             there is too much us-ascii stuff after it to use a single
             encoded word. We add the next word to the encoded region
             and try again. */
          assert (t1 < u + ulen);
          for (t1++; t1 < u + ulen && !HSPACE(*t1); t1++)
            ;
          continue;
        }
        dprint(5, (debugfile, "rfc2047en: n(choose_block#2)=%d t1=%d\n", n, t1 - u));
        n = choose_block (t, n, col, icode, tocode, &encoder, &wlen);
      }
    }

produces

[2020-11-05 08:56:56] rfc2047en: n(choose_block#1)=6 t1=6
[2020-11-05 08:56:56] rfc2047en: n(choose_block#2)=5 t1=6
[2020-11-05 08:56:56] rfc2047en: n(choose_block#1)=1 t1=6

so the quoted code selects 5 bytes (= "Groß") for encoding.

How to start thinking about solutions? Could it be that counting bytes vs. characters (possibly containing multi-byte chars) is messed up? For me it makes more sense to work on byte level, so assuming this, then the part with n = t1 - t - 1 CONTINUATION_BYTE() (not doing any harm for this case) looks to be wrong here (or I don't understand the length definition for encode_block() correctly).

Assignee Loading
Time tracking Loading