Upgrade Global Search `prod-gitlab-com indexing-20200330` Elasticsearch cluster to `7.9.2`
Production Change
Change Summary
We want to upgrade prod-gitlab-com indexing-20200330
to the latest version of Elasticsearch. We at least want to upgrade to 7.7
to benefit from some performance improvements but it seems 7.9
is already out so we may as well upgrade to the latest version.
We will want to upgrade the com-gitlab-staging indexing-20200406
cluster first to verify on staging.
It will additionally be verified in CI first gitlab-org/gitlab!44547 (merged)
Change Details
- Services Impacted - Global search
- Change Technician - @DylanGriffith (and SRE with access TBD)
- Change Criticality - C3
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - DRI for the review of this change
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - No. The ES rolling upgrade should not require downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Confirm new ES version works in CI with passing pipeline gitlab-org/gitlab!44547 (merged) -
In Elastic Cloud UI upgrade com-gitlab-staging indexing-20200406
to the latest7.x
version -
Wait until rolling upgrade is complete -
Go to Staging and test that searches in the gitlab-org
group still work and return results -
Add a comment to some issue with the text searchablecomment2801
and then search for that comment. Noting that indexing and refreshing of ES index can take up to 2 minutes to complete before the result shows up
Change Steps - steps to take to execute the change
-
Pause indexing in GitLab > Admin > Settings > General > Advanced Search
-
Wait 2 mins for queues to drain -
Take a snapshot of the cluster -
In Elastic Cloud UI upgrade prod-gitlab-com indexing-20200330
to the latest7.x
version -
Wait until rolling upgrade is complete -
Test that searches in the gitlab-org
group still work and return results -
Unpause indexing in GitLab > Admin > Settings > General > Advanced Search
-
Wait until the "Sidekiq Queues (Global Search)" have caught up -
Add a comment to this issue with the text searchablecomment2801
and then search for that comment. Noting that indexing and refreshing of ES index can take up to 2 minutes to complete before the result shows up.
Post-Change Steps - steps to take to verify the change
None
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
- If the upgrade completed but something is not working then we can restore an older version of Elasticsearch from the snapshot. Then update the credentials in
GitLab > Admin > Settings > General > Advanced Search
to point to this new cluster.
Monitoring
Key metrics to observe
- Metric: Search overview metrics
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
- What changes to this metric should prompt a rollback: Flatline of RPS
- Metric: Search controller performance
- Location: https://dashboards.gitlab.net/d/web-rails-controller/web-rails-controller?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-controller=SearchController&var-action=show
- What changes to this metric should prompt a rollback: Massive spike in latency
- Metric: Search sidekiq indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=elastic_commit_indexer&var-queue=cronjob:elastic_index_bulk_cron&var-queue=cronjob:elastic_index_initial_bulk_cron&var-queue=elastic_delete_project
- What changes to this metric should prompt a rollback: High error rates
Summary of infrastructure changes
-
Does this change introduce new compute instances? : No -
Does this change re-size any existing compute instances? : No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? : No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.
Edited by Dylan Griffith