Add 10% of Bronze customers to Elasticsearch advanced global search rollout

Production Change - Criticality 3 C3

Change Objective	Describe the objective of the change
Change Type	Operation
Services Impacted	Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster
Change Team Members	@DylanGriffith
Change Severity	C3
Change Reviewer or tested in staging	This has been done before on production #1788 (closed)
Dry-run output	-
Due Date	2020-04-09 05:22:12 UTC
Time tracking

Be aware that this issue will be public so we should not mention customer names
Private conversations found in: https://gitlab.com/gitlab-org/gitlab/issues/208877
Estimate sizes of groups: using this script https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size
Confirm we have capacity in our queues based on how frequently we're hitting 1000 updates in a minute and the average payload. Capacity is 1000 so hitting that more than half the time means we need to increase capacity.
- Analysis from https://log.gprd.gitlab.net/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2020-04-08T04:45:50.095Z',to:'2020-04-09T04:45:52.723Z'))&_a=(columns:!(json.records_count),index:AW5F1e45qthdGjPJueGO,interval:auto,query:(language:kuery,query:bulk_indexing_start),sort:!(!(json.time,desc))) reveals an average of 282 bulk run meaning that double the number of groups is safe (still less than 1000) and we're only seeing 1000 updates about 10% of the time
Increase cluster size based on above consideration
- We are only at 10% full on disk space right now and this will only increase the total size by a factor of ~2 and therefore we shouldn't expect to be above 20% disk usage after this.

	namespaces	projects	repository size	issues	merge requests	comments
Currently in index	310	31214	806 GB	405 K	606 K	-1
Added to index	301	14010	715 GB	307 K	414 K	-1

In `gitlab-production` index	Before	After
Total	1.1 TB	2.1 TB
Documents	36.8 M	64.4 M

Gitlab admin panel:
- queue for ElasticIndexerWorker
- queue for ElasticCommitIndexerWorker
Grafana:
- Platform triage
- Sidekiq:
  - Sidekiq SLO dashboard overview
    - Sidekiq Queue Lengths per Queue
      - Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
    - Sidekiq Inflight Operations by Queue
    - Node Maximum Single Core Utilization per Priority
      - expected to be 100% during initial indexing
- Redis-sidekiq:
  - Redis-sidekiq SLO dashboard overview
    - Memory Saturation
      - If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
      - If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
Incremental updates queue:
- Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- From rails console Elastic::ProcessBookkeepingService.queue_size

Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
Grafana:
- Rails:
  - Search controller performance
- Postgres:
  - patroni SLO dashboard overview
  - postgresql overview
  - pgbouncer SLO dashboard overview
  - pgbouncer overview
  - "Waiting Sidekiq pgbouncer Connections"
    - If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
  - Gitaly SLO dashboard overview
  - Gitaly latency
  - Gitaly saturation overview
  - Gitaly single node saturation
    - If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down ElasticCommitIndexerWorker that it will help and then stop if it's clearly correlated.

General "ES integration in Gitlab" runbook: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md
Rollback to previous percentage with:
- curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollback?plan=bronze&percentage=5'
nuclear option stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed): sudo gitlab-ctl status sidekiq-cluster
nuclear option disable anything related to ES in the entire Gitlab instance (note this does not clear sidekiq queues or kill running jobs): https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#disabling-es-integration
nuclear option clear elasticsearch sidekiq queue: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#dropping-an-entire-queue
nuclear option stop running sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#kill-running-jobs-as-opposed-to-removing-jobs-from-a-queue

Edited Apr 11, 2020 by Dylan Griffith