Partial reindex production Elasticsearch to debug failures

Production Change

Change Summary

Trigger a reindex of some of the data in our Production global search Elasticsearch cluster to debug gitlab-org/gitlab#233348 (closed) . Elastic support has asked in https://support.elastic.co/customers/s/case/5004M00000eAtbN if we can trigger this so they can debug:

is there any reindex operation that could cause this issue planned for the short term, or could you trigger it? If you can reproduce it in a non-production environment it would be great.

The reason for the ask is so you can let us know well in advance when it will happen, and the Cloud team will be ready to investigate live what's happening during the reindex - not only inspecting the hosts logs, but also metrics, status of connectivity, etc. From the Elasticsearch point of view there does not seem to be any problem, so we would like to investigate the issue at a lower level.

Change Details

  1. Services Impacted - Elasticsearch cluster prod-gitlab-com indexing-20200330 . It will likely lead to a small CPU increase for the duration of reindexing (up to 48 hrs). It will likely increase storage usage on the cluster by no more than 20% as I will only reindex issues.
  2. Change Technician - @DylanGriffith
  3. Change Criticality - C4
  4. Change Type - changescheduled
  5. Change Reviewer - @dgruzd
  6. Due Date - 2020-08-19 07:00 UTC
  7. Time tracking -
  8. Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

  1. Test out steps on staging

Estimated Time to Complete (mins) - 1hr

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 48hr (mostly just waiting for the reindex to finish)

  1. Confirm the cluster storage is less than 50% full
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are triggering a partial reindex of our production Elasticsearch global search cluster which will re-index some of our production global search index to another index in the same cluster using the Elasticsearch reindex API. This is for debugging purposes to allow Elastic support to (hopefully) debug why our reindexes failed last time. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2530
  3. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  4. Create new destination index gitlab-production-issues-only-reindex with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-issues-only-reindex' })
  5. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-issues-only-reindex/_settings"
  6. Trigger re-index from source index gitlab-production-202007270000 to destination index gitlab-production-issues-only-reindex for issues only
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007270000", "query": { "match": { "type": "issue" } } }, "dest": { "index": "gitlab-production-issues-only-reindex" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
  7. Note the returned task ID from the above: 9Tg2b3gxS9Kg80cQP-_eQA:13125340
  8. Note the time when the task started: 2020-08-19 07:00:45 UTC
  9. Wait for the task to finish. You can track it with:
    • curl $CLUSTER_URL/_tasks/$TASK_ID
  10. When it is finished add a comment to this issue with the output from curl $CLUSTER_URL/_tasks/$TASK_ID
  11. Create new destination index gitlab-production-20-percent-reindex with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-20-percent-reindex' })
  12. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20-percent-reindex/_settings"
  13. Trigger re-index from source index gitlab-production-202007270000 to destination index gitlab-production-20-percent-reindex for issues only
    1. curl -H 'Content-Type: application/json' -d '{ "max_docs": 136780589, "source": { "index": "gitlab-production-202007270000" }, "dest": { "index": "gitlab-production-20-percent-reindex" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
  14. Note the returned task ID from the above: 9Tg2b3gxS9Kg80cQP-_eQA:13198975
  15. Note the time when the task started: 2020-08-19 07:31 UTC
  16. Wait for the task to finish. You can track it with:
    • curl $CLUSTER_URL/_tasks/$TASK_ID
  17. When it is finished add a comment to this issue with the output from curl $CLUSTER_URL/_tasks/$TASK_ID
  18. Note the time the task finished 2020-08-19 12:00:55 UTC

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5m

  • Delete the new index gitlab-production-issues-only-reindex
    • curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
  • Delete the new index gitlab-production-20-percent-reindex
    • curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex
  • Confirm it is gone
    • curl $CLUSTER_URL/_cat/indices

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5m

  • Cancel the reindex
    • curl -XPOST $CLUSTER_URL/_tasks/$TASK_ID/_cancel
  • Delete the new index gitlab-production-issues-only-reindex
    • curl -XDELETE $CLUSTER_URL/gitlab-production-issues-only-reindex
  • Delete the new index gitlab-production-20-percent-reindex
    • curl -XDELETE $CLUSTER_URL/gitlab-production-20-percent-reindex

Monitoring

Key metrics to observe

  • Elasticsearch monitoring
    • We may wish to cancel if CPU were consistently pegged as this could affect the search experience

Summary of infrastruture changes

  • Does this change introduce new compute instances? NO
  • Does this change re-size any existing compute instances? NO
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? NO

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by Dylan Griffith