Add 20% of Bronze customers to Elasticsearch advanced global search rollout

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type Operation
Services Impacted Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster
Change Team Members @DylanGriffith
Change Severity C3
Change Reviewer or tested in staging This has been done before on production #1925 (closed)
Dry-run output -
Due Date 2020-04-30 00:02:40 UTC
Time tracking

Detailed steps for the change

Pre-check

  • Be aware that this issue will be public so we should not mention customer names
  • Estimate sizes of groups: using this script https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size
  • Confirm we have capacity in our queues based on how frequently we're hitting 1000 updates in a minute and the average payload. Capacity is 1000 so hitting that more than half the time means we need to increase capacity.
  • Increase cluster size based on above consideration
    • Using repository size to estimate we are only increasing overall storage by approximately 730 / 2310 = 32% but our cluster is ~25% full so we should be good for storage.
namespaces projects repository size issues merge requests comments
Currently in index 937 64948 2.31 TB 969 K 1.45 M -1
Added to index 313 15235 730 GB 224 K 422 K -1

Roll out

  1. Re-run the size estimate to confirm it hasn't increased significantly since last time https://gitlab.com/gitlab-org/gitlab/-/issues/211756#script-to-estimate-size
  2. Mention in #support_gitlab-com: @support-dotcom We are expanding our roll-out of Elasticsearch to more bronze customers on GitLab.com. These customers may notice changes in their global searches. Please let us know if you need help investigating any related tickets. Follow progress at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2012. The most reliable way to know if a group has Elasticsearch enabled is to see the "Advanced search functionality is enabled" indicator at the top right of the search results page..
  3. Create silence on the alert for "The elastic_indexer queue, main stage, has a queue latency outside of SLO" at https://alerts.gprd.gitlab.net/#/silences/new with env="gprd" type="sidekiq" priority="elasticsearch" (copy from https://alerts.gprd.gitlab.net/#/silences/36298cce-d887-42c8-bfe4-3ffcc3b3d91d)
  4. Check with SRE on call in #production: @sre-oncall I would like to roll out this change https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2012. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. Please note this may trigger a high CPU alert for sidekiq workers but since Elasticsearch has a dedicated sidekiq fleet it should not impact any other workers. I have also created a relevant alert silence for sidekiq SLO since it will very likely backlog the queues which is fine https://alerts.gprd.gitlab.net/#/silences/36298cce-d887-42c8-bfe4-3ffcc3b3d91d . The overall indexing will probably take around 24 hrs to complete.
  5. note disk space and number of documents for the gitlab-production index
  6. login as admin, get a personal access token with API scope that expires tomorrow
  7. Invoke API to add percentage
    • curl -X PUT -H "Private-Token: $API_TOKEN" -i 'https://gitlab.com/api/v4/elasticsearch_indexed_namespaces/rollout?plan=bronze&percentage=20'
  8. note start time: 2020-04-30 00:02:40 UTC and update Due Date in table
  9. Wait for namespaces to finish indexing
    1. Look at Sidekiq Queue Lengths per Queue
    2. Find correlation ID for original ElasticNamespaceRolloutWorker job:
      • Correlation ID: oOCDBZmGwd
    3. Look for done jobs with that correlation_id. When you are finished there should be 3 jobs done (1xElasticIndexerWorker, 2xElasticCommitIndexerWorker) per project in the group.
  10. note end time: 2020-05-01 02:11:27 UTC
  11. note time taken: ~26 hr
  12. note increase in index size
  13. Check for [any failed projects for that correlation_id](https://log.gprd.gitlab.net/goto/b6f4877e7db12153019662a64e312791 and manually retry them if necessary
  14. Test out searching
In gitlab-production index Before After
Total 2.8 TB 3.6 TB
Documents 81.1 M 99.2 M

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith (ex GitLab)