Elasticsearch rake task with UPDATE_INDEX does not work as on docs
Summary
UPDATE_INDEX
parameter is not working as specified on the docs (https://docs.gitlab.com/ee/integration/elasticsearch.html#indexing-large-instances), it ignores old commits.
Steps to reproduce
- Create a project, commit several times to it.
- Enable Elasticsearch indexing.
- Commit again to the project.
- Check the index, it should only have this last commit on the index.
- Run
gitlab-rake gitlab:elastic:index_repositories UPDATE_INDEX=true ID_FROM=<project_id> ID_TO=<project_id>
- Check again the commits on the index, only the last one is present, old ones were not added to the index.
What is the current bug behavior?
If we enable ES indexing for everything new that is pushed/created on the Gitlab instance, only the last commit will be indexed. This makes sense as it will only index new content, not existing content. According to https://docs.gitlab.com/ee/integration/elasticsearch.html#indexing-large-instances I copy the expected behaviour of using UPDATE_INDEX
as a parameter:
As the indexer stores the last commit SHA of every indexed repository in the database,
you can run the indexer with the special parameter UPDATE_INDEX and it will check every project
repository again to make sure that every commit in that repository is indexed.
As this explains, using UPDATE_INDEX
should index old commits as well if they were not indexed already. This is not happening, only the last one stays on the index.
We have a big instance with a lot of projects. This instance is being used all the time, so no downtime can be applied to index everything and then enable the indexing of new content.
What is the expected correct behavior?
ALL commits should appear on the index after using UPDATE_INDEX
on an indexing rake task.
We are using 10.5.7-ee