Buffer Writes to ElasticSearch and use the Bulk API
Problem
At present we are writing to ElasticSearch for every change to every model. These very frequent writes to ElasticSearch cause performance issues at scale.
It seems very plausible that before we reach all of GitLab.com scale we're likely to need to buffer/batch these updates and use the bulk API.
It's also plausible that our current implementation is not adequate for our largest self-hosted customers.
Solution
Summarised from #34086 (comment 230326472)
- Updates to models go into sorted sets in Redis rather than queued as sidekiq jobs.
- A periodic sidekiq worker runs every 1 minute and pops the top 1k update jobs from the sorted set
- The 1k update jobs are transformed into a bulk update for Elasticsearch
- Failed updates are retried
Likely follow up issues
- Timeouts for updates are retried in increasingly smaller batches similar to #195774 (closed) . This may be preferable to do in a follow up issue since running this every 1 minute likely won't build up large payloads (ie. we're unlikely to hit 1k jobs in a minute until we roll this out more widely)
- #205148 (closed)
- #205178
Feature flag
This functionality is gated behind a default-off feature flag: :elastic_bulk_incremental_updates
. An issue to roll out the feature flag and subsequently remove the FF is at: #208717 (closed)
Edited by Nick Thomas