Revamp reply emails parsing

This is an attempt to summarize all the issues with reply emails parsing, making it easier to find clues.

All related issues

  • HTML emails gitlab-ce#2847, gitlab-ce#3357, gitlab-ce#15545, gitlab-ce#18388, gitlab-ce#23340
  • Inline/bottom replies support gitlab-ce#3020, gitlab-ce#14805, gitlab-ce#20514
  • Strip signatures gitlab-ce#3061, gitlab-ce#14786
  • Ignore auto-generated emails gitlab-ce#18548
  • Incident: https://gitlab.com/gitlab-com/infrastructure/issues/1#note_17599430 , gitlab-ce#24003
  • Incident: https://0xacab.org/riseup/0xacab/issues/11

Challenges

  • Different email clients (e.g. gitlab-ce#18388)
  • Different languages
  • HTML emails
  • Auto-generated emails
  • Signatures

Suggested solutions

  • We leave markers which we could recognize later in the emails (I think Discourse is doing this, also a ton of support tickets system)
  • Have a list of different formats email clients could be using (some clients would use | for quoting)
  • Don't use Markdown, just plaintext (GitHub is doing this, but this could still be very terrible. Here's an example of woes)

Reference implementation

  • https://github.com/github/email_reply_parser
  • https://github.com/discourse/email_reply_trimmer
  • https://github.com/discourse/discourse/commits/master/lib/email/receiver.rb

Some stopped effort

  • https://gitlab.com/gitlab-org/gitlab-ce/commits/adopt-email_reply_trimmer (failed build: https://gitlab.com/gitlab-org/gitlab-ce/pipelines/3869613)

/cc @smcgivern @DouweM @MrChrisW @dblessing

Edited Aug 14, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading