Make elasticsearch blob index idempotent

Currently when we index blobs, we store the content in our Elasticsearch index.

If we were to try to backfill some commits because of an indexing gap, where newer commits have already been indexed, we could end up overriding a blob with stale contents.

Example:

Let's say we have a file, a.txt, that's been indexed

a.txt

this is some text
and this is some more text
and even more!

Let's say that we for some reason would want to index some old commits (because they're not in the index) and one of those commits has a.txt but with the following old contents:

this is some text
and some more text

If we were to run our elasticsearch indexer, then the blob content in the index would be the old content.

We should consider either:

  • using optimistic concurrency control to avoid overriding content or
  • asking the git repo for the newest version of blobs

This issue stems from https://gitlab.com/gitlab-org/gitlab-ee/issues/8013 - why does a later index operation completely negate an earlier one?

Edited Mar 07, 2019 by Mario de la Ossa
Assignee Loading
Time tracking Loading