Elasticsearch: return to using a separate index per document type
Elasticsearch has deprecated parent-child relationships and toplevel document types: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html
Use of these features is not possible long-term (Valery: there is a transparent replacement join datatype). In the short term, use of these features causes us two problems:
- The index size is bloated by an as-yet-unquantified size, because every document has every field of every type
- Use of parent-child relationships means we can't spread the documents out evenly across elasticsearch shards. Some shards end up with far more data on them than others, as a result.
We should investigate returning to the old situation of having an index per document type, treating
blobs as separate data types, of course.
Joins, if necessary, can be done in-application or using the new
(Previous discussion related to splitting the
repository type up, now obsolete)
The following discussion from !2709 (merged) should be addressed:
I see we have talked about splitting the types in the issue, which makes sense to me: #3011 (comment 37888885)
Per #3011 (comment 37888885) , currently we store 'commits' and 'blobs' in elasticsearch with a
repository. This means commits have all the fields of blobs, and vice-versa. It also complicates querying these document types, and causes bugs.
Can we do this with a data migration? To me, asking our users to reindex all their repositories for this is unreasonable.
A thought: how much extra space do these fields take up per document, even though they're empty?