Confirm we have capacity in our queues based on how frequently we're hitting 1000 updates in a minute and the average payload. Capacity is 1000 so hitting that more than half the time means we need to increase capacity.
Increase cluster size based on above consideration
Using repository size to estimate we are only increasing overall storage by approximately 160 / 4680 = 3% but our cluster is ~20% full so we should be good for storage.
Mention in #support_gitlab-com: We are expanding our roll-out of Elasticsearch to more bronze customers on GitLab.com. These customers may notice changes in their global searches. Please let us know if you need help investigating any related tickets. Follow progress at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2209. The most reliable way to know if a group has Elasticsearch enabled is to see the "Advanced search functionality is enabled" indicator at the top right of the search results page..
Create silence on the alert for "The elastic_indexer queue, main stage, has a queue latency outside of SLO" at https://alerts.gprd.gitlab.net/#/silences/new with env="gprd" type="sidekiq" priority="elasticsearch"
Check with SRE on call in #production: @sre-oncall I would like to roll out this change https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2233. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. I have also created a relevant alert silence for sidekiq SLO since it will very likely backlog the queues which is fine https://alerts.gprd.gitlab.net/#/silences/6cfc1075-b0ea-4754-9289-1d434ba98b87 . The overall indexing will probably take around 8 hrs to complete. Note we will also be increasing the number of Elasticsearch sidekiq workers for the duration of this change request.
Scale up sidekiq fleet
note disk space and number of documents for the gitlab-production index
login as admin, get a personal access token with API scope that expires tomorrow
Invoke API to add percentage
curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollout?plan=bronze&percentage=33'
note start time: 2020-06-14 22:17 UTC and update Due Date in table
Look for done jobs with that correlation_id. When you are finished there should be 3 jobs done (1xElasticIndexerWorker, 2xElasticCommitIndexerWorker) per project in the group.
Scale back down sidekiq fleet => Skipped per #2209 (comment 360804431) as we will be leaving this new scale.
If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down ElasticCommitIndexerWorker that it will help and then stop if it's clearly correlated.
curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollback?plan=bronze&percentage=5'
nuclear option stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed): sudo gitlab-ctl status sidekiq-cluster