Skip to content

Implement a percentage based rollout for ElasticSearch on GitLab.com

Problem

We believe rolling out ElasticSearch to all our GitLab.com projects will mean a very large volume of data being indexed and searched. It may not be safe to do this as an all or nothing rollout since it doesn't give us enough time to react to problems and scale out our infrastructure or make indexing/searching more efficient. It would also be a lot of manual effort of enabling then rolling back constantly as we learn about a new scaling challenge.

Solution

Roll out to a percentage of groups at a time starting with Gold groups.

This will require some changes to GitLab to support this in a sensible way.

Currently we have an ability to limit the groups that are being indexed/searched in Elasticsearch but it has the following problems:

  1. It likely does not handle very large numbers of groups in the list (it was only designed to be used for a few groups) and so the admin UI will probably break or timeout when there are hundreds or thousands in this list. It may also have performance impacts in other parts of the code when we check this list.
  2. This was intended to be used in such a way that we would enable for a set of groups then we'd allow the indexing to finish before enabling it for searching.
  3. Apart from clicking through the UI or writing one off scripts for the console there is no controlled way to roll this out to large numbers of groups

Extend this logic of rolling out to groups

In order to solve the above problems we'll want to:

  1. Adapt this feature so that it does not display all the groups that are part of the rollout in the admin UI when the number exceeds some sensible limit (eg. 20)
  2. Ensure in all places this logic is being used that it scales sensibly when there are thousands of groups in the rollout
  3. Set an extra boolean index_statuses.records_initially_indexed indicating that we've finished Elastic::IndexRecordService#initial_import_project for the given project
  4. Update our logic in use_elasticsearch? to ensure that all projects within the current scope (ie. all projects in the group or just this project for project search) have the index_statues.initial_import_complete as true
  5. Create a script that can be run from rails console to enable for large numbers of groups at a time
  6. Ensure that you can remove groups from the rollout without data loss or bugs so that they stop being indexed and searched in case any parts of the system start to become overloaded

TODO

  1. Hide projects/namespaces when there are more than 50 in the admin UI
  2. Store status as index_statuses.records_initially_indexed after indexing all DB records is completed for a project Skip per !20760 (comment 258510946)
  3. Determine whether or not to use Elasticsearch based on non-empty SHA in IndexStatus and also IndexStatus#records_initially_indexed? => true Skip per !20760 (comment 258510946)
  4. --- Assign review ---
  5. Validate how the different features behave when there are 100,000 namespaces and 100,000 projects enabled
    1. What queries are happening when loading the search page scoped to one of those groups
    2. What queries are happening when loading the search page scoped to a different group that is not enabled
    3. What queries are happening when loading the search page scoped to one of those projects
    4. What queries are happening when loading the search page scoped to a different project that is not enabled
  6. --- Merge ---
  7. Add an API to trigger rollout to percentages at a time (admin only) => send the desired rollout percentage. We first check if the number is already greater than or equal this and do nothing if so (idempotent) otherwise we grab the next set of ids (ordered by id, which would help us later figure out which ones had been enabled, and we should also log it) and then we enable for them.
  8. --- Merge ---
  9. Add support to rollback the percentage via admin API (we should reverse order the elasticsearch_indexed_namespaces by created_at here so we disable only the most recently enabled.
  10. --- Assign review ---
  11. Validate it's safe to remove something from the rollout, make some changes to that project, then re-add it to the rollout? Is it idempotent?
  12. --- Merge ---
Edited by Mark Chao