10x disk usage amplification in Xapian backend
I'm not sure whether this is an issue in Hyperkitty or xapian-haystack, but we're seeing pretty horrible disk usage in the Xapian backend compared to the default Whoosh backend. We have about 300k emails on lists.torproject.org, which gives us a ~2-3GiB PostgreSQL database, and was exploded into a 8GB Whoosh index.
For Xapian, however, it's 35Gib!
root@lists-01:~# du -sh /var/lib/mailman3/web/* /var/lib/postgresql/
14M /var/lib/mailman3/web/static
35G /var/lib/mailman3/web/xapian_index
2.8G /var/lib/postgresql/
(The ~8GB whoosh index is gone as I deleted it to make room for xapian, you'll just have to trust me on this or peruse our incident log at https://gitlab.torproject.org/tpo/tpa/team/-/issues/41957#note_3157791)
Anyways, it seems like Xapian takes an order of magnitude more space than the original dataset (35/2.8 = 12.5x). I don't think this is an issue with Xapian itself: I use notmuch to index my mail, and the ratio there is much better, almost the inverse: my Mail spool is 22.5GiB including a 4.1GiB Xapian database, so I actually have 18.4GiB of mail that is reduced to 4.1GiB of index, a 4.5x reduction in size (or, to use the same direction, 0.2x from the original dataset, or five-fold decrease instead of a twelve-fold increase).
For now, this is less critical than other issues we've encountered with Xapian (#408, which is definitely a deal-breaker): we've just grown the disk space... But it seems kind of ridiculous that 300k mails would take up 35G of disk space, just for the search engine. That's 107kB per email, which doesn't seem like a lot until you remember that Mailman has a 40kB size limit by default and the average email size is probably far below that...
For what it's worth, the average email size in my decades old email spool is 37kB, according to this:
anarcat@angela:~$ find Maildir/ -type f | xargs stat -c %s | awk 'BEGIN { count = 0; sum = 0 } { count++; sum += $1 } END { print sum/count }'
37171.5