Add 30 more bronze customers to Elasticsearch advanced global search rollout
C3
Production Change - Criticality 3Change Objective | Describe the objective of the change |
---|---|
Change Type | Operation |
Services Impacted | Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster |
Change Team Members | @DylanGriffith @mwasilewski-gitlab @dgruzd |
Change Severity | C3 |
Change Reviewer or tested in staging | This was performed many times in the past in both staging and production, for example see: https://gitlab.com/gitlab-com/gl-infra/production/issues/1608 |
Dry-run output | - |
Due Date | 2020-03-11 08:44:00 UTC |
Time tracking |
Detailed steps for the change
Pre-check
-
Be aware that this issue will be public so we should not mention customer names. They should be referred to by number 1 through 10 -
Top level groups can be found in: https://gitlab.com/gitlab-org/gitlab/issues/208877 -
Estimate sizes of group: https://gitlab.com/gitlab-org/gitlab/issues/208877#customers using this script gitlab-org&1736 (closed) -
Increase cluster size based on above considerations
Roll out
-
Mention in #support_gitlab-com
:We are expanding our roll-out of Elasticsearch to paid customers on GitLab.com. These customers may notice changes in their global searches. Please let us know if you need help investigating any related tickets. Follow progress at https://gitlab.com/gitlab-com/gl-infra/production/issues/1724 and note the customers enabled in https://gitlab.com/gitlab-org/gitlab/issues/208877. The most reliable way to know if a group has Elasticsearch enabled is to see the "Advanced search functionality is enabled" indicator at the top right of the search results page
. -
Check with SRE on call in #production
:@sre-oncall I would like to roll out this change <LINK>. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. Please note this will likely trigger a high CPU alert for sidekiq workers but since Elasticsearch has a dedicated sidekiq fleet it should not impact any other workers.
-
Update Elasticsearch refresh_interval
to60
-
note disk space and number of documents for the gitlab-production
index -
login as admin, go to Admin > Settings > Integrations > Elasticsearch
-
add the top level groups to indexed namespaces -
note start time: 2020-03-11 08:44:00 UTC
and update Due Date in table -
Wait for namespace to finish indexing -
Look at Sidekiq Queue Lengths per Queue -
Look for ElasticNamespaceIndexerWorker
and find thecorrelation_id
-
Look for done
jobs with thatcorrelation_id
. When you are finished there should be 3 jobs done (1xElasticIndexerWorker
, 2xElasticCommitIndexerWorker
) per project in the group.
-
-
note end time: 2020-03-12 02:59:00 UTC
-
note time taken: 18 hr 15m
=> this is not a good representation though because the workers were stopped for a long time due to an ongoing incident -
note increase in index size -
Check for [any failed projects for that correlation_id
](https://log.gprd.gitlab.net/goto/b6f4877e7db12153019662a64e312791 and manually retry them if necessary-
Some failures happened but most were fixed on retry. There were 2 projects that failed multiple times in the list but both of those ended up being indexed at some point which I verified with Project#index_status
for each of them.
-
-
Test out searching -
Set Elasticsearchrefresh_interval
to1
In gitlab-production index
|
Before | After |
---|---|---|
Total | 597.7 GB | 1015.7 GB |
Documents | 11.4 M | 21.3 M |
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
-
Sidekiq SLO dashboard overview
-
Sidekiq Queue Lengths per Queue
- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
Sidekiq Inflight Operations by Queue
-
Node Maximum Single Core Utilization per Priority
- expected to be 100% during initial indexing
-
-
Sidekiq SLO dashboard overview
- Redis-sidekiq:
-
Redis-sidekiq SLO dashboard overview
-
Memory Saturation
- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
-
-
Redis-sidekiq SLO dashboard overview
- Incremental updates queue: From rails console
Elastic::ProcessBookkeepingService.queue_size
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
-
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
-
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorker
that it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
Rollback steps
-
General "ES integration in Gitlab" runbook: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md -
remove the two added namespaces from the admin panel -
clear elasticsearch sidekiq queue: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#dropping-an-entire-queue -
stop running sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#kill-running-jobs-as-opposed-to-removing-jobs-from-a-queue -
clean up changes made to the ES index (you actually don't need to do anything, src: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#cleaning-up-index ) -
Set Elasticsearch refresh_interval
back to1
-
nuclear option stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed): sudo gitlab-ctl status sidekiq-cluster
-
nuclear option disable anything related to ES in the entire Gitlab instance (note this does not clear sidekiq queues or kill running jobs): https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#disabling-es-integration
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith