Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `7.15.1`

Production Change

We want to upgrade the Global Search Elasticsearch cluster prod-gitlab-com indexing-20200330 to the latest version of Elasticsesarch 7.15.1.

We will upgrade the com-gitlab-staging indexing-20200406 staging cluster first to verify.

CI has been updated to 7.14.2 in: gitlab-org/gitlab!72651 (merged)

Follow up issue to update CI to 7.15.1 when it's available in Dockerhub: gitlab-org/gitlab#343447 (closed)

Services Impacted - ServiceSearch
Change Technician - @dgruzd @terrichu @john-mason (and SRE with access TBD)
Change Reviewer - @dgruzd
Time tracking - 60 minutes (changes) + 360 minutes (rollback)
Downtime Component - no downtime required by using ES rolling upgrade. However, indexing will be paused during the upgrade

Estimated Time to Complete (mins) - 45

Set label changein-progress on this issue
Confirm new ES version works in CI with passing pipeline: https://gitlab.com/gitlab-org/gitlab/-/pipelines/392036368
Pause indexing in staging GitLab > Admin > Settings -> General > Advanced Search
Wait 2 mins for queues to drain
Add a new comment test comment to an issue and verify that the Elasticsearch queue increases in the graph
Take a snapshot of the staging cluster (cloud-snapshot-2021.10.27-svsrxahltvc4ltvzn57yfw)
In Elastic Cloud UI upgrade staging cluster com-gitlab-staging indexing-20200406 to 7.15.1 version

Just for this upgrade, we would like to perform a practice run of a restore in staging cluster.

Restore an older version of Elasticsearch from the snapshot
[-] Update the credentials in GitLab > Admin > Settings > General > Advanced Search to point to this new cluster created from the restore
Go to staging and test that searches in the gitlab-org group still work and return results. We should not unpause indexing since that could result in data loss
[-] Update the credentials in GitLab > Admin > Settings > General > Advanced Search to point to the original upgraded cluster
[-] Go to staging and test that searches in the gitlab-org group still work and return results
Unpause indexing in staging GitLab > Admin > Settings -> General > Advanced Search
Add a comment to an issue with the text test comment 3 and then search for that comment. Note that indexing and refreshing of ES index can take up to 2 minutes to complete before the results show up.

Estimated Time to Complete (mins) - 45

Estimated Time to Complete (mins) - 5

Add a comment to an issue (issue TBD) with the text searchablecomment20211019 and then search for that comment. Note that indexing and refreshing of ES index can take up to 2 minutes to complete before the results show up.
Search for a commit that was added after indexing was paused

Estimated Time to Complete (mins) - 360

If the upgrade completed but something is not working then we can restore an older version of Elasticsearch from the snapshot. Then update the credentials in GitLab > Admin > Settings > General > Advanced Search to point to this new cluster.

Metric: Search overview metrics
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
- What changes to this metric should prompt a rollback: Flatline of RPS
Metric: Search controller performance
- Location: https://dashboards.gitlab.net/d/web-rails-controller/web-rails-controller?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-controller=SearchController&var-action=show
- What changes to this metric should prompt a rollback: Massive spike in latency
Metric: Search sidekiq indexing queues (Sidekiq Queues (Global Search))
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: Queues not draining
Metric: Search sidekiq in flight jobs
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-30m&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=elasticsearch
- What changes to this metric should prompt a rollback: No jobs in flight

[-] Does this change introduce new compute instances? No
[-] Does this change re-size any existing compute instances? No
[-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Edited Oct 28, 2021 by Dmitry Gruzd