Skip to content

Elasticsearch: return to using a separate index per document type

Elasticsearch has deprecated parent-child relationships and toplevel document types: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html

Use of these features is not possible long-term (Valery: there is a transparent replacement join datatype). In the short term, use of these features causes us two problems:

  • The index size is bloated by an as-yet-unquantified size, because every document has every field of every type
  • Use of parent-child relationships means we can't spread the documents out evenly across elasticsearch shards. Some shards end up with far more data on them than others, as a result.

We should investigate returning to the old situation of having an index per document type, treating commits and blobs as separate data types, of course.

Joins, if necessary, can be done in-application or using the new join capability.


(Previous discussion related to splitting the repository type up, now obsolete)

The following discussion from !2709 (merged) should be addressed:

Per https://gitlab.com/gitlab-org/gitlab-ee/issues/3011#note_37888885 , currently we store 'commits' and 'blobs' in elasticsearch with a _type of repository. This means commits have all the fields of blobs, and vice-versa. It also complicates querying these document types, and causes bugs.

Can we do this with a data migration? To me, asking our users to reindex all their repositories for this is unreasonable.

A thought: how much extra space do these fields take up per document, even though they're empty?

/cc @vsizov @smcgivern @victorcete

Edited by Valery Sizov