Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `7.15.1`
Production Change
Change Summary
We want to upgrade the Global Search Elasticsearch cluster prod-gitlab-com indexing-20200330
to the latest version of Elasticsesarch 7.15.1
.
We will upgrade the com-gitlab-staging indexing-20200406
staging cluster first to verify.
CI has been updated to 7.14.2 in: gitlab-org/gitlab!72651 (merged)
Follow up issue to update CI to 7.15.1 when it's available in Dockerhub: gitlab-org/gitlab#343447 (closed)
Change Details
- Services Impacted - ServiceSearch
- Change Technician - @dgruzd @terrichu @john-mason (and SRE with access TBD)
- Change Reviewer - @dgruzd
- Time tracking - 60 minutes (changes) + 360 minutes (rollback)
- Downtime Component - no downtime required by using ES rolling upgrade. However, indexing will be paused during the upgrade
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 45
-
Set label changein-progress on this issue -
Confirm new ES version works in CI with passing pipeline: https://gitlab.com/gitlab-org/gitlab/-/pipelines/392036368 -
Pause indexing in staging GitLab > Admin > Settings -> General > Advanced Search
-
Wait 2 mins for queues to drain -
Add a new comment test comment
to an issue and verify that the Elasticsearch queue increases in the graph -
Take a snapshot of the staging cluster (cloud-snapshot-2021.10.27-svsrxahltvc4ltvzn57yfw) -
In Elastic Cloud UI upgrade staging cluster com-gitlab-staging indexing-20200406
to7.15.1
version
Just for this upgrade, we would like to perform a practice run of a restore in staging cluster.
-
Restore an older version of Elasticsearch from the snapshot - [-] Update the credentials in
GitLab > Admin > Settings > General > Advanced Search
to point to this new cluster created from the restore -
Go to staging and test that searches in the gitlab-org group still work and return results. We should not unpause indexing since that could result in data loss - [-] Update the credentials in
GitLab > Admin > Settings > General > Advanced Search
to point to the original upgraded cluster - [-] Go to staging and test that searches in the gitlab-org group still work and return results
-
Unpause indexing in staging GitLab > Admin > Settings -> General > Advanced Search
-
Add a comment to an issue with the text test comment 3
and then search for that comment. Note that indexing and refreshing of ES index can take up to 2 minutes to complete before the results show up.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 45
-
Contact the SRE on call and ask for permission to proceed -
Pause indexing in production GitLab > Admin > Settings -> General > Advanced Search
-
Wait 2 mins for queues to drain -
Add a new comment to an issue and verify that the Elasticsearch queue increases in the graph -
Take a snapshot of the production cluster -
In Elastic Cloud UI upgrade production cluster prod-gitlab-com indexing-20200330
to7.15.1
version -
Wait until the rolling upgrade is complete -
Verify that there are no errors in the Kibana logs -
Test that searches in the gitlab-org
group still work and return results -
Unpause indexing in production GitLab > Admin > Settings -> General > Advanced Search
-
Wait until the Sidekiq Queues (Global Search)
have caught up
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Add a comment to an issue (issue TBD) with the text searchablecomment20211019
and then search for that comment. Note that indexing and refreshing of ES index can take up to 2 minutes to complete before the results show up. -
Search for a commit that was added after indexing was paused
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 360
-
If the upgrade completed but something is not working then we can restore an older version of Elasticsearch from the snapshot. Then update the credentials in GitLab > Admin > Settings > General > Advanced Search
to point to this new cluster.
Monitoring
Key metrics to observe
- Metric: Search overview metrics
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
- What changes to this metric should prompt a rollback: Flatline of RPS
- Metric: Search controller performance
- Location: https://dashboards.gitlab.net/d/web-rails-controller/web-rails-controller?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-controller=SearchController&var-action=show
- What changes to this metric should prompt a rollback: Massive spike in latency
- Metric: Search sidekiq indexing queues (
Sidekiq Queues (Global Search)
)- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: Queues not draining
- Metric: Search sidekiq in flight jobs
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-30m&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=elasticsearch
- What changes to this metric should prompt a rollback: No jobs in flight
Elastic Cloud outages: https://status.elastic.co/#past-incidents
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? No
- [-] Does this change re-size any existing compute instances? No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Dmitry Gruzd