Skip to content

Mbox export produces incorrect MIME messages

As I maintain a tool that programmatically fetches and parses messages from mailman archives, I noticed a few issues with mbox exports from HyperKitty:

  1. HTML texts are provided as attachments rather than as an alternative to the plaintext: Currently, HyperKitty produces a multipart/mixed document for each message that looks like this:

    multipart/mixed
    - text/plain
    - [text/html, Content-Disposition: attachment; filename="attachment.html"]
    - [attachments...]

    This structure causes different problems in multiple clients I tried, including Thunderbird, Windows Mail, and Squeak, because the formatted text is not displayed inside the reader but just provided as a separate attachment. Instead, HyperKitty should produce a multipart/alternative document for the text according to RFC1341:

    The multipart/alternative type is syntactically identical to multipart/mixed, but the semantics are different. In particular, each of the parts is an "alternative" version of the same information. User agents should recognize that the content of the various parts are interchangeable. The user agent should either choose the "best" type based on the user's environment and preferences, or offer the user the available alternatives. 1

    So, the entire message could look like this instead:

    multipart/mixed
    - multipart/alternative
      - text/plain
      - [text/html, Content-Disposition: attachment; filename="attachment.html"]
    - [attachments...]

    See also: https://stackoverflow.com/q/3902455/13994294

  2. Inline documents are not linked correctly: Both text/html and sometimes text/plain messages may contain inline documents (mostly, inline images). To correctly describe these documents, the relevant document inside the message should have a nested multipart/related structure as follows according to RFC2387/RFC2392:

    multipart/related
    - text/plain | text/html
    - [image/png, Content-Disposition: inline; filename="image.png", Content-ID: <foo>]
    - [image/jpeg, Content-Disposition: inline; filename="image.jpeg", Content-ID: <bar>]
    - ...

    Inside the text message, the documents may then be referenced using <img src="cid:foo"> in HTML or [cid:foo] in plaintext.

    Currently, HyperKitty does not even provide these CIDs, making it impossible to reconstruct inline images correctly even when I post-process the downloaded files.

  3. HTML documents do not specify an explicit charset: It seems that all messages are encoded as UTF-8, but not all clients assume that as a default. It would be helpful to always provide charset=utf-8 for these documents.

  4. Email obfuscation (@ -> (at)) character breaks contents: While messages that contain @ such as this one look fine in the web interface, they are broken in the mbox export. Next to worsening readability, this breaks mail addresses that people might want to reply to and URLs to other conversations on HyperKitty that people might want to open in the browser (see the example message). In some but not all cases, only the HTML version is affected. Is this obfuscation really necessary, in particular in this simple form and when it is not applied on the website itself?

Thank you for delivering this great service! It would be awesome if these issues could be solved. I'm there for any questions!

  1. https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html#:~:text=of%20subtype%0A%22mixed%22.-,7.2.3%20%20%20%20%20The%20Multipart%2Falternative,The,-multipart%2Falternative%20type

Edited by Christoph Thiede