Optimize mappings (including _source subfields) to reduce total disk space in elasticsearch
This issue is inspired by these blog posts, which I found while investigating once again whether we could turn off the _source
field entirely:
-
https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0
We forgot to delete the original string representation of the date/time, which is safe to do if the date/time parsing is successful. This oversight is significant since we discovered the date/time string values made up about 20% of the overall index size in our test data set
-
https://www.elastic.co/blog/filebeat-modiles-access-logs-and-elasticsearch-storage-requirements
a substantial 19.7% space saving.
turning off _source
entirely is strongly recommended against because it breaks all sorts of things - syntax highlighting (which we use) and zero-downtime migration between mappings (which we want to use), for a start: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
However, It seems like some changes to our mappings - and I'm not sure exactly what they are - could save us quite a bit of space, without losing any features.
In particular, I have the committer and author timestamps in mind in the commit document type. These are numerically dominant, and in the common case, their text will be a relatively short commit message (<80 bytes), plus two names (~20 bytes) and timestamps (~25 bytes) each. If we can remove the text representation of the timestamps, this could save a really significant proportion of the total.
AIUI, in the timestamp case, we can always get back to the original text from the parsed timestamp anyway, so we lose nothing - and no features stop working, including the possibility of seamless, zero-downtime migration between schemas - if we stop storing the text representation of these fields.
Follow-on from this slack thread: https://gitlab.slack.com/archives/C3TMLK465/p1552581452032100