Add 33% of Bronze customers to Elasticsearch advanced global search rollout
C3
Production Change - Criticality 3Change Objective | Describe the objective of the change |
---|---|
Change Type | Operation |
Services Impacted | Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster |
Change Team Members | @DylanGriffith @cmiskell |
Change Severity | C3 |
Change Reviewer or tested in staging | This has been done before on production #2185 (closed) |
Dry-run output | - |
Due Date | 2020-06-14 22:17 UTC |
Time tracking |
Detailed steps for the change
Pre-check
- Be aware that this issue will be public so we should not mention customer names
- Estimate sizes of groups: using this script https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size
- Confirm we have capacity in our queues based on how frequently we're hitting 1000 updates in a minute and the average payload. Capacity is 1000 so hitting that more than half the time means we need to increase capacity.
-
Increase cluster size based on above consideration
- Using repository size to estimate we are only increasing overall storage by approximately
160 / 4680 = 3%
but our cluster is ~20% full so we should be good for storage.
- Using repository size to estimate we are only increasing overall storage by approximately
namespaces | projects | repository size | issues | merge requests | comments | |
---|---|---|---|---|---|---|
Currently in index | 2087 | 129202 | 4.68 TB | 1.78 M | 2.97 M | -1 |
Added to index | 75 | 2600 | 160 GB | 30.1 K | 72.7 K | -1 |
Roll out
- Re-run the size estimate to confirm it hasn't increased significantly since last time https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size
-
Mention in
#support_gitlab-com
:We are expanding our roll-out of Elasticsearch to more bronze customers on GitLab.com. These customers may notice changes in their global searches. Please let us know if you need help investigating any related tickets. Follow progress at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2209. The most reliable way to know if a group has Elasticsearch enabled is to see the "Advanced search functionality is enabled" indicator at the top right of the search results page.
. -
Create silence on the alert for "The
elastic_indexer
queue,main
stage, has a queue latency outside of SLO" at https://alerts.gprd.gitlab.net/#/silences/new withenv="gprd" type="sidekiq" priority="elasticsearch"
-
Check with SRE on call in
#production
:@sre-oncall I would like to roll out this change https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2233. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. I have also created a relevant alert silence for sidekiq SLO since it will very likely backlog the queues which is fine https://alerts.gprd.gitlab.net/#/silences/6cfc1075-b0ea-4754-9289-1d434ba98b87 . The overall indexing will probably take around 8 hrs to complete. Note we will also be increasing the number of Elasticsearch sidekiq workers for the duration of this change request.
- Scale up sidekiq fleet
-
note disk space and number of documents for the
gitlab-production
index - login as admin, get a personal access token with API scope that expires tomorrow
-
Invoke API to add percentage
curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollout?plan=bronze&percentage=33'
-
note start time:
2020-06-14 22:17 UTC
and update Due Date in table -
Wait for namespaces to finish indexing
- Look at [Sidekiq Queue Lengths per Queue ](ht ps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1)
-
Find correlation ID for original
ElasticNamespaceRolloutWorker
job:- Correlation ID:
LJyTRuXXrO9
- Correlation ID:
-
Look for
done
jobs with thatcorrelation_id
. When you are finished there should be 3 jobs done (1xElasticIndexerWorker
, 2xElasticCommitIndexerWorker
) per project in the group.
-
Scale back down sidekiq fleet=> Skipped per #2209 (comment 360804431) as we will be leaving this new scale. -
note end time:
2020-06-14 23:30 UTC
-
note time taken:
73 mins
- note increase in index size
-
Check for [any failed projects for that
correlation_id
](https://log.gprd.gitlab.net/goto/b6f4877e7db12153019662a64e312791 and manually retry them if necessary - Kibana with the correlation ID
In gitlab-production index
|
Before | After |
---|---|---|
Total | 1.5 TB | 1.5 TB |
Documents | 150.5 M | 153.3 M |
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
-
Sidekiq SLO dashboard overview
-
Sidekiq Queue Lengths per Queue
- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
Sidekiq Inflight Operations by Queue
-
Node Maximum Single Core Utilization per Priority
- expected to be 100% during initial indexing
-
-
Sidekiq SLO dashboard overview
- Redis-sidekiq:
-
Redis-sidekiq SLO dashboard overview
-
Memory Saturation
- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
-
-
Redis-sidekiq SLO dashboard overview
- Incremental updates queue:
- Chart
Global search incremental indexing queue depth
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1 - From rails console
Elastic::ProcessBookkeepingService.queue_size
- Chart
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
-
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
-
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorker
that it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
Rollback steps
- General "ES integration in Gitlab" runbook: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md
-
Rollback to previous percentage with:
-
curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollback?plan=bronze&percentage=5'
-
-
nuclear option stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed):
sudo gitlab-ctl status sidekiq-cluster
- nuclear option disable anything related to ES in the entire Gitlab instance (note this does not clear sidekiq queues or kill running jobs): https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#disabling-es-integration
- nuclear option clear elasticsearch sidekiq queue: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#dropping-an-entire-queue
- nuclear option stop running sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#kill-running-jobs-as-opposed-to-removing-jobs-from-a-queue
Changes checklist
- Detailed steps and rollback steps have been filled prior to commencing work
- Person on-call has been informed prior to change being rolled out