Use subprocesses for ElasticSearch index jobs
( @vsizov please correct me if I am explaining incorrectly how ES works at the moment (8.4/8.5))
When a user pushes a new commit (and ElasticSearch is enabled) we let Sidekiq index the changes in the commit and the contents of each (non-binary) blob in the tree. Our ES index rake task loops through all projects and indexes the last commit that was pushed and each blob in the tree at that commit.
What we see happening now on gitlab.com is that the index rake tasks use a lot of memory, even if we only let them run for 1000 projects. 7+GB is not uncommon.
Considering how badly our long-running Ruby processes leak memory, and how much data is loaded into memory during ES indexing, I think it would be worth it to use subprocesses to do the actual indexing. I think using fork
from a GitLab application process will be fragile, it is probably more reliable to create a script in bin/
that indexes a repo at a certain commit and sends data to ElasticSearch.
bin/index-commit /path/to/repo.git COMMIT-SHA
If we don't do something like this I foresee exploding Sidekiq processes and index rake tasks once certain customers start using the ElasticSearch feature in GitLab EE.