Partial reindex production Elasticsearch to debug failures

Production Change

Change Summary

Trigger a reindex of some of the data in our Production global search Elasticsearch cluster to debug gitlab-org/gitlab#233348 (closed) . Elastic support has asked in https://support.elastic.co/customers/s/case/5004M00000eAtbN if we can trigger this so they can debug:

is there any reindex operation that could cause this issue planned for the short term, or could you trigger it? If you can reproduce it in a non-production environment it would be great.

The reason for the ask is so you can let us know well in advance when it will happen, and the Cloud team will be ready to investigate live what's happening during the reindex - not only inspecting the hosts logs, but also metrics, status of connectivity, etc. From the Elasticsearch point of view there does not seem to be any problem, so we would like to investigate the issue at a lower level.

Change Details

Services Impacted - Elasticsearch cluster prod-gitlab-com indexing-20200330 . It will likely lead to a small CPU increase for the duration of reindexing (up to 48 hrs). It will likely increase storage usage on the cluster by no more than 20% as I will only reindex issues.
Change Technician - @DylanGriffith
Change Criticality - C4
Change Type - changescheduled
Change Reviewer - @dgruzd
Due Date - 2020-08-19 07:00 UTC
Time tracking -
Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Test out steps on staging

Estimated Time to Complete (mins) - 1hr

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 48hr (mostly just waiting for the reindex to finish)

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5m

Delete the new index gitlab-production-issues-only-reindex
- curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
Delete the new index gitlab-production-20-percent-reindex
- curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex
Confirm it is gone
- curl $CLUSTER_URL/_cat/indices

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5m

Cancel the reindex
- curl -XPOST $CLUSTER_URL/_tasks/$TASK_ID/_cancel
Delete the new index gitlab-production-issues-only-reindex
- curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
Delete the new index gitlab-production-20-percent-reindex
- curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex

Monitoring

Key metrics to observe

Elasticsearch monitoring
- We may wish to cancel if CPU were consistently pegged as this could affect the search experience

Summary of infrastruture changes

Does this change introduce new compute instances? NO
Does this change re-size any existing compute instances? NO
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? NO

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Aug 20, 2020 by Dylan Griffith