Partial reindex production Elasticsearch to debug failures
Production Change
Change Summary
Trigger a reindex of some of the data in our Production global search Elasticsearch cluster to debug gitlab-org/gitlab#233348 (closed) . Elastic support has asked in https://support.elastic.co/customers/s/case/5004M00000eAtbN if we can trigger this so they can debug:
is there any reindex operation that could cause this issue planned for the short term, or could you trigger it? If you can reproduce it in a non-production environment it would be great.
The reason for the ask is so you can let us know well in advance when it will happen, and the Cloud team will be ready to investigate live what's happening during the reindex - not only inspecting the hosts logs, but also metrics, status of connectivity, etc. From the Elasticsearch point of view there does not seem to be any problem, so we would like to investigate the issue at a lower level.
Change Details
-
Services Impacted - Elasticsearch cluster
prod-gitlab-com indexing-20200330
. It will likely lead to a small CPU increase for the duration of reindexing (up to 48 hrs). It will likely increase storage usage on the cluster by no more than 20% as I will only reindex issues. - Change Technician - @DylanGriffith
- Change Criticality - C4
- Change Type - changescheduled
- Change Reviewer - @dgruzd
- Due Date - 2020-08-19 07:00 UTC
- Time tracking -
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
-
Test out steps on staging
Estimated Time to Complete (mins) - 1hr
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 48hr (mostly just waiting for the reindex to finish)
-
Confirm the cluster storage is less than 50% full -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are triggering a partial reindex of our production Elasticsearch global search cluster which will re-index some of our production global search index to another index in the same cluster using the Elasticsearch reindex API. This is for debugging purposes to allow Elastic support to (hopefully) debug why our reindexes failed last time. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2530
-
In any console set CLUSTER_URL
and confirm that it is the expected cluster with expected indices:-
curl $CLUSTER_URL/_cat/indices
-
-
Create new destination index gitlab-production-issues-only-reindex
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-issues-only-reindex' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-issues-only-reindex/_settings"
-
Trigger re-index from source index gitlab-production-202007270000
to destination indexgitlab-production-issues-only-reindex
for issues onlycurl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007270000", "query": { "match": { "type": "issue" } } }, "dest": { "index": "gitlab-production-issues-only-reindex" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: 9Tg2b3gxS9Kg80cQP-_eQA:13125340
-
Note the time when the task started: 2020-08-19 07:00:45 UTC
-
Wait for the task to finish. You can track it with: curl $CLUSTER_URL/_tasks/$TASK_ID
-
When it is finished add a comment to this issue with the output from curl $CLUSTER_URL/_tasks/$TASK_ID
-
Create new destination index gitlab-production-20-percent-reindex
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-20-percent-reindex' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20-percent-reindex/_settings"
-
Trigger re-index from source index gitlab-production-202007270000
to destination indexgitlab-production-20-percent-reindex
for issues onlycurl -H 'Content-Type: application/json' -d '{ "max_docs": 136780589, "source": { "index": "gitlab-production-202007270000" }, "dest": { "index": "gitlab-production-20-percent-reindex" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: 9Tg2b3gxS9Kg80cQP-_eQA:13198975
-
Note the time when the task started: 2020-08-19 07:31 UTC
-
Wait for the task to finish. You can track it with: curl $CLUSTER_URL/_tasks/$TASK_ID
-
When it is finished add a comment to this issue with the output from curl $CLUSTER_URL/_tasks/$TASK_ID
-
Note the time the task finished 2020-08-19 12:00:55 UTC
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5m
-
Delete the new index gitlab-production-issues-only-reindex
curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
-
Delete the new index gitlab-production-20-percent-reindex
curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex
-
Confirm it is gone curl $CLUSTER_URL/_cat/indices
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5m
-
Cancel the reindex curl -XPOST $CLUSTER_URL/_tasks/$TASK_ID/_cancel
-
Delete the new index gitlab-production-issues-only-reindex
curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
-
Delete the new index gitlab-production-20-percent-reindex
curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex
Monitoring
Key metrics to observe
-
Elasticsearch monitoring
- We may wish to cancel if CPU were consistently pegged as this could affect the search experience
Summary of infrastruture changes
-
Does this change introduce new compute instances? NO
-
Does this change re-size any existing compute instances? NO
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? NO
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.