Add 10% of Bronze customers to Elasticsearch advanced global search rollout
Production Change - Criticality 3 C3
| Change Objective | Describe the objective of the change |
|---|---|
| Change Type | Operation |
| Services Impacted | Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster |
| Change Team Members | @DylanGriffith |
| Change Severity | C3 |
| Change Reviewer or tested in staging | This has been done before on production #1788 (closed) |
| Dry-run output | - |
| Due Date | 2020-04-09 05:22:12 UTC |
| Time tracking |
Detailed steps for the change
Pre-check
-
Be aware that this issue will be public so we should not mention customer names -
Private conversations found in: https://gitlab.com/gitlab-org/gitlab/issues/208877 -
Estimate sizes of groups: using this script https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size -
Confirm we have capacity in our queues based on how frequently we're hitting 1000 updates in a minute and the average payload. Capacity is 1000 so hitting that more than half the time means we need to increase capacity. - Analysis from https://log.gprd.gitlab.net/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2020-04-08T04:45:50.095Z',to:'2020-04-09T04:45:52.723Z'))&_a=(columns:!(json.records_count),index:AW5F1e45qthdGjPJueGO,interval:auto,query:(language:kuery,query:bulk_indexing_start),sort:!(!(json.time,desc))) reveals an average of 282 bulk run meaning that double the number of groups is safe (still less than 1000) and we're only seeing 1000 updates about 10% of the time
-
Increase cluster size based on above consideration - We are only at 10% full on disk space right now and this will only increase the total size by a factor of ~2 and therefore we shouldn't expect to be above 20% disk usage after this.
| namespaces | projects | repository size | issues | merge requests | comments | |
|---|---|---|---|---|---|---|
| Currently in index | 310 | 31214 | 806 GB | 405 K | 606 K | -1 |
| Added to index | 301 | 14010 | 715 GB | 307 K | 414 K | -1 |
Roll out
-
Re-run the size estimate to confirm it hasn't increased significantly since last time https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size -
Mention in #support_gitlab-com:@support-dotcom We are expanding our roll-out of Elasticsearch to more bronze customers on GitLab.com. These customers may notice changes in their global searches. Please let us know if you need help investigating any related tickets. Follow progress at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1925. The most reliable way to know if a group has Elasticsearch enabled is to see the "Advanced search functionality is enabled" indicator at the top right of the search results page.. -
Create silence on the alert for "The elastic_indexerqueue,mainstage, has a queue latency outside of SLO" at https://alerts.gprd.gitlab.net/#/silences/new withenv="gprd" type="sidekiq" priority="elasticsearch" -
Check with SRE on call in #production:@sre-oncall I would like to roll out this change https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1925. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. Please note this may trigger a high CPU alert for sidekiq workers but since Elasticsearch has a dedicated sidekiq fleet it should not impact any other workers. I have also created a relevant alert silence for sidekiq SLO since it will very likely backlog the queues which is fine https://alerts.gprd.gitlab.net/#/silences/11be0b54-5e8c-43e2-b76a-ead16dee8803 . The overall indexing will probably take around 26 hrs comparing to doing the same thing yesterday so I've made the silence last 36 hours to be safe. . -
note disk space and number of documents for the gitlab-productionindex -
login as admin, get a personal access token with API scope that expires tomorrow -
Invoke API to add percentagecurl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollout?plan=bronze&percentage=15'
-
Due to gitlab-org/gitlab#213777 (closed) we need to do via rails console with: ElasticsearchIndexedNamespace.drop_limited_ids_cache!; ElasticNamespaceRolloutWorker.perform_async('bronze', 10, 'rollout'); ElasticsearchIndexedNamespace.drop_limited_ids_cache!- #1925 (comment 320731446)
-
note start time: 2020-04-09 05:22:12 UTCand update Due Date in table -
Wait for namespaces to finish indexing -
Look at Sidekiq Queue Lengths per Queue -
Find correlation ID for original API request: XX -
Look for donejobs with thatcorrelation_id. When you are finished there should be 3 jobs done (1xElasticIndexerWorker, 2xElasticCommitIndexerWorker) per project in the group.
-
-
note end time: 2020-04-10 10:56 UTC -
note time taken: ~30 hr -
note increase in index size -
Check for [any failed projects for that correlation_id](https://log.gprd.gitlab.net/goto/b6f4877e7db12153019662a64e312791 and manually retry them if necessary -
Test out searching
In gitlab-production index
|
Before | After |
|---|---|---|
| Total | 1.1 TB | 2.1 TB |
| Documents | 36.8 M | 64.4 M |
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
-
Sidekiq SLO dashboard overview
-
Sidekiq Queue Lengths per Queue- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
Sidekiq Inflight Operations by Queue-
Node Maximum Single Core Utilization per Priority- expected to be 100% during initial indexing
-
-
Sidekiq SLO dashboard overview
- Redis-sidekiq:
-
Redis-sidekiq SLO dashboard overview
-
Memory Saturation- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
-
-
Redis-sidekiq SLO dashboard overview
- Incremental updates queue:
- Chart
Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1 - From rails console
Elastic::ProcessBookkeepingService.queue_size
- Chart
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
-
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
-
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorkerthat it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
Rollback steps
-
General "ES integration in Gitlab" runbook: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md -
Rollback to previous percentage with: -
curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollback?plan=bronze&percentage=5'
-
-
nuclear option stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed): sudo gitlab-ctl status sidekiq-cluster -
nuclear option disable anything related to ES in the entire Gitlab instance (note this does not clear sidekiq queues or kill running jobs): https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#disabling-es-integration -
nuclear option clear elasticsearch sidekiq queue: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#dropping-an-entire-queue -
nuclear option stop running sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#kill-running-jobs-as-opposed-to-removing-jobs-from-a-queue
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith