Reindex GitLab.com global search cluster

Production Change - Criticality 3 C3

Change Objective	Describe the objective of the change
Change Type	ConfigurationChange
Services Impacted	GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq
Change Team Members	@DylanGriffith @aamarsanaa
Change Criticality	C3
Change Reviewer or tested in staging	Tested in staging at gitlab-com/runbooks!2017 (comment 311596689)
Dry-run output
Due Date	2020-04-01 05:20:26 UTC
Time tracking

Run all the steps on staging
- gitlab-com/runbooks!2017 (comment 311596689)
Run a dry run process in production where all we do is reindex the data to the Destination Cluster
- #1862 (comment 313510680)
- #1862 (comment 314402159)

~~CpFaAeAsQL6SOWYjJs1kLg:1319927~~ => failed with Remote responded with a chunk that was too large. Use a smaller batch size.
Triggered again with batch size of 1000 => AiFJYPk1Rn6eVaHTxVMTag:1198080

~~2020-04-01 05:20:26 UTC~~ => failed with Remote responded with a chunk that was too large. Use a smaller batch size.
Triggered again with batch size of 1000 at 2020-04-01 15:10:46 UTC

Switch GitLab settings to point back to Source Cluster
Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Gitlab admin panel:
- queue for ElasticIndexerWorker
- queue for ElasticCommitIndexerWorker
Grafana:
- Platform triage
- Sidekiq:
  - Sidekiq SLO dashboard overview
    - Sidekiq Queue Lengths per Queue
      - Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
    - Sidekiq Inflight Operations by Queue
    - Node Maximum Single Core Utilization per Priority
      - expected to be 100% during initial indexing
- Redis-sidekiq:
  - Redis-sidekiq SLO dashboard overview
    - Memory Saturation
      - If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
      - If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
Incremental updates queue:
- Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- From rails console Elastic::ProcessBookkeepingService.queue_size

Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
Grafana:
- Rails:
  - Search controller performance
- Postgres:
  - patroni SLO dashboard overview
  - postgresql overview
  - pgbouncer SLO dashboard overview
  - pgbouncer overview
  - "Waiting Sidekiq pgbouncer Connections"
    - If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
  - Gitaly SLO dashboard overview
  - Gitaly latency
  - Gitaly saturation overview
  - Gitaly single node saturation
    - If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down ElasticCommitIndexerWorker that it will help and then stop if it's clearly correlated.

If we got past the step "Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster" then:
- Switch GitLab settings to point back to Source Cluster in GitLab > Admin > Settings > Integrations
If we got past the step "Re-enable indexing writes":
- Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

If we stopped before "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster" then all you need to do to rollback is:
- "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster"

Edited Apr 02, 2020 by Dylan Griffith