Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0`
Production Change
Change Summary
We want to upgrade the Global Search Elasticsearch cluster prod-gitlab-com indexing-20200330
to the latest version of Elasticsesarch 8.2.0
.
We will upgrade the com-gitlab-staging indexing-20200406
staging cluster first to verify.
Change Details
- Services Impacted - ServiceSearch
- Change Technician - @john-mason (and SRE with access TBD)
- Change Reviewer - @dgruzd @terrichu
- Time tracking - 120 minutes (changes) + 360 minutes (rollback)
- Downtime Component - no downtime required by using ES rolling upgrade. However, indexing will be paused during the upgrade
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 4 hours
Phase 0: prep
-
Ensure monitoring cluster is v8, per Elastic upgrade instructions: https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elastic-stack.html
Phase 1: staging upgrade
-
Set label changein-progress /label ~change::in-progress
In Staging
-
Pause indexing https://staging.gitlab.com/admin/application_settings/advanced_search -
Create snapshot -
Create v8 deployment called staging-gitlab-com indexing-<CURRENT_DATE>
-
Tick the box Restore snapshot data
-
Select staging deployment in the Restore from
dropdown -
Select 8.2.0
in theVersion
dropdown -
Ensure there is enough capacity for staging at least 120 GB storage | 4 GB RAM | Up to 2.5 vCPU
across two zones -
Store username and password of new v8 cluster in 1Password vault
-
-
Make local copy of current Advanced Search settings -
Endpoint, username, password
-
-
Ensure new cluster is added to Monitoring cluster -
Change Advanced Search settings -
elasticsearch_url
endpoint to staging v8 endpoint -
elasticsearch_user
to staging v8 user -
elasticsearch_password
to staging v8 password -
Save changes
-
-
Test read code paths -
Code -
Notes
-
-
Resume indexing and wait for queue to drain -
Test write code paths -
Code -
Notes
-
Phase 2: production upgrade
In Production
-
Add a silence via https://alerts.gitlab.net/#/silences/new with a matcher on the following alert names (link the comment field in each silence back to the Change Request Issue URL) alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
alertname="gitlab_search_indexing_queue_backing_up"
-
Pause indexing Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
-
Create snapshot -
Create v8 deployment called prod-gitlab-com indexing-<CURRENT_DATE>
-
Tick the box Restore snapshot data
-
Select production deployment in the Restore from
dropdown -
Select 8.2.0
in theVersion
dropdown -
Ensure there is enough capacity for staging at least 13.13 TB storage | 448 GB RAM | 69 vCPU
across two zones -
Store username and password of new v8 cluster in 1Password vault
-
-
Change Elastic password offline from zoom -
Make local copy of current Advanced Search settings -
Endpoint, username, password
-
-
Change Advanced Search settings ApplicationSetting.current.update(elasticsearch_url: ELASTIC_URL, elasticsearch_username: ELASTIC_USER, elasticsearch_password: ELASTIC_PASSWORD)
-
elasticsearch_url
endpoint to production v8 endpoint -
elasticsearch_user
to production v8 user -
elasticsearch_password
to production v8 password -
Save changes
-
-
Test read code paths -
Code -
Notes
-
-
Resume indexing and wait for queue to drain Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
-
Test write code paths -
Code -
Notes
-
Phase 3: cleanup
-
Remove any insecure copies of staging credentials -
v7 cluster -
v8 cluster
-
-
Remove any insecure copies of production credentials -
v7 cluster -
v8 cluster
-
-
Delete staging v7 cluster -
Delete production v7 cluster -
You do not need to manually delete snapshots, because both clusters have SLM policies: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-lifecycle-management.html. Additionally, it may be useful in the event of the need to an unexpected restore. -
Change settings for Elasticsearch exporter to point towards new cluster endpoints: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15761 -
Change settings for Mr. Metric Man checks (elastic cluster + elastic deployment) -
Change CI Elasticsearch to 8.2.0
: gitlab-org/gitlab!88608 (merged) -
Update GDK to have same version: gitlab-org/gitlab-development-kit!2573 (merged) -
Set label changecomplete /label ~change::complete
-
Do a little dance 💃
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Change Advanced Search settings back to original v7 cluster -
Resume indexing -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by John Mason