Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0`

Production Change

Change Summary

We want to upgrade the Global Search Elasticsearch cluster prod-gitlab-com indexing-20200330 to the latest version of Elasticsesarch 8.2.0.

We will upgrade the com-gitlab-staging indexing-20200406 staging cluster first to verify.

Change Details

Services Impacted - ServiceSearch
Change Technician - @john-mason (and SRE with access TBD)
Change Reviewer - @dgruzd @terrichu
Time tracking - 120 minutes (changes) + 360 minutes (rollback)
Downtime Component - no downtime required by using ES rolling upgrade. However, indexing will be paused during the upgrade

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 4 hours

Phase 0: prep

Ensure monitoring cluster is v8, per Elastic upgrade instructions: https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elastic-stack.html

Phase 1: staging upgrade

Set label changein-progress /label ~change::in-progress

In Staging

Pause indexing https://staging.gitlab.com/admin/application_settings/advanced_search
Create snapshot
Create v8 deployment called staging-gitlab-com indexing-<CURRENT_DATE>
- Tick the box Restore snapshot data
- Select staging deployment in the Restore from dropdown
- Select 8.2.0 in the Version dropdown
- Ensure there is enough capacity for staging at least 120 GB storage | 4 GB RAM | Up to 2.5 vCPU across two zones
- Store username and password of new v8 cluster in 1Password vault
Make local copy of current Advanced Search settings
- Endpoint, username, password
Ensure new cluster is added to Monitoring cluster
Change Advanced Search settings
- elasticsearch_url endpoint to staging v8 endpoint
- elasticsearch_user to staging v8 user
- elasticsearch_password to staging v8 password
- Save changes
Test read code paths
- Code
- Notes
Resume indexing and wait for queue to drain
Test write code paths
- Code
- Notes

Phase 2: production upgrade

In Production

Add a silence via https://alerts.gitlab.net/#/silences/new with a matcher on the following alert names (link the comment field in each silence back to the Change Request Issue URL)
- alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
- alertname="gitlab_search_indexing_queue_backing_up"
Pause indexing Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
Create snapshot
Create v8 deployment called prod-gitlab-com indexing-<CURRENT_DATE>
- Tick the box Restore snapshot data
- Select production deployment in the Restore from dropdown
- Select 8.2.0 in the Version dropdown
- Ensure there is enough capacity for staging at least 13.13 TB storage | 448 GB RAM | 69 vCPU across two zones
- Store username and password of new v8 cluster in 1Password vault
Change Elastic password offline from zoom
Make local copy of current Advanced Search settings
- Endpoint, username, password
Change Advanced Search settings ApplicationSetting.current.update(elasticsearch_url: ELASTIC_URL, elasticsearch_username: ELASTIC_USER, elasticsearch_password: ELASTIC_PASSWORD)
- elasticsearch_url endpoint to production v8 endpoint
- elasticsearch_user to production v8 user
- elasticsearch_password to production v8 password
- Save changes
Test read code paths
- Code
- Notes
Resume indexing and wait for queue to drain Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
Test write code paths
- Code
- Notes

Phase 3: cleanup

Remove any insecure copies of staging credentials
- v7 cluster
- v8 cluster
Remove any insecure copies of production credentials
- v7 cluster
- v8 cluster
Delete staging v7 cluster
Delete production v7 cluster
You do not need to manually delete snapshots, because both clusters have SLM policies: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-lifecycle-management.html. Additionally, it may be useful in the event of the need to an unexpected restore.
Change settings for Elasticsearch exporter to point towards new cluster endpoints: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15761
Change settings for Mr. Metric Man checks (elastic cluster + elastic deployment)
Change CI Elasticsearch to 8.2.0: gitlab-org/gitlab!88608 (merged)
Update GDK to have same version: gitlab-org/gitlab-development-kit!2573 (merged)
Set label changecomplete /label ~change::complete
Do a little dance 💃

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Change Advanced Search settings back to original v7 cluster
Resume indexing
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes

Change Reviewer checklist

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited May 31, 2022 by John Mason

Assignee

Select assignees

Time tracking