Skip to content

Add search index pruning

John Mason requested to merge jm-search-curation-pruner into master

What does this MR do and why?

Automatically trims read-only indices that have been rolled over from search curation by reindexing documents to the current write index. This will result in all of our indices being roughly within the same sizing guidelines that we recommend: https://docs.gitlab.com/ee/integration/advanced_search/elasticsearch.html#tuning.

When bloated read-only indices are present, Search::IndexPruningWorker will continuously schedule itself to reindex those documents to the current write index. When there are not any bloated read-only indices, Search::IndexPruningWorker will have a schedule of checking every 30 minutes.

Note: this only addresses the case where rolled over indices are too big. There will be instances where rolled over indices can actually become too small over time, but that will be addressed in another iteration.

The changes here are behind the feature flag search_index_pruning_worker

Screenshots or screen recordings

Curator + Pruner results from running locally in a development environment. Indices are roughly the same size.

GET _cat/indices/gitlab-development-*?v&s=index

image

How to set up and validate locally

  1. (Optional) Start tailing advanced search log file in another pane: tail -f log/elasticsearch.log
  2. Ensure you have a rolled-over index locally, by checking GET _cat/indices/gitlab-development*?v. You should have one large read-only index, and one almost empty write-index.
    • If you don't have any indices rolled-over yet, in the console run: Gitlab::Search::IndexCurator.curate(dry_run: false) (this will ignore curation feature flags)
  3. Verify that Gitlab::CurrentSettings.search_pruning_max_docs is 100 for your local dev environment
  4. Manually trigger pruner (example below)
  5. There should now be 100 fewer documents in the read-only index and 100 more in write index.

Pruning is done in reverse alphabetical order on index_name. In my local environment, notes index gets pruned first:

Before
health status index                                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   gitlab-development-20230204-2003                xczKTZJHTMes4a-03Zx71w   5   1         98            0    117.5kb        117.5kb
yellow open   gitlab-development-commits-20230204-2003        YyKwDGvcR6iG9f4cXFgZxg   5   1          0            0        1kb            1kb
yellow open   gitlab-development-issues-20230204-2003         I6y5XrhUT3Ct4zEIhMTawA   5   1        461            0    155.6kb        155.6kb
yellow open   gitlab-development-issues-20230204-2004         9UOD3domS6CY9BsEG_fP_w   5   1          0            0        1kb            1kb
yellow open   gitlab-development-merge_requests-20230204-2003 J8EosgbPRbKUhndJ_GYaOw   5   1        141            0    105.2kb        105.2kb
yellow open   gitlab-development-merge_requests-20230204-2004 V12_1H45SdiKm_rTAH1ppg   5   1          0            0        1kb            1kb
yellow open   gitlab-development-migrations                   aOjdQ2EWQ72lokYmDGqoCw   1   1         31            0      6.2kb          6.2kb
yellow open   gitlab-development-notes-20230204-2003          uHna-nu5Ss6wJeh8svLxLg   5   1        937            0    154.6kb        154.6kb
yellow open   gitlab-development-notes-20230204-2004          dwx4oT0XSQig6i4bwEbaqA   5   1          0            0        1kb            1kb
yellow open   gitlab-development-users-20230204-2003          kz6tzHs8SEa6o7ycqy_-tw   5   1         47            0     61.9kb         61.9kb
pry(main)> p = ::Gitlab::Search::Curation::Pruner.new(curator_settings: {ignore_patterns: []}, max: Gitlab::CurrentSettings.search_pruning_max_docs)
=> #<Gitlab::Search::Curation::Pruner:0x0000000138fed3f8
 @curator=
  #<Gitlab::Search::IndexCurator:0x0000000138fed3d0
   @settings={:dry_run=>true, :debug=>false, :force=>false, :max_shard_size_gb=>1, :max_docs_denominator=>100, :min_docs_before_rollover=>50, :max_docs_shard_count=>5, :ignore_patterns=>[], :include_patterns=>[], :index_pattern=>"gitlab-development*"}>,
 @debug=false,
 @max=100,
 @pct=0.2>
pry(main)> p.bloated_readonly_indices
=> [{:reasons=>["too many docs"], :info=>{"health"=>"yellow", "status"=>"open", "index"=>"gitlab-development-notes-20230204-2003", "uuid"=>"uHna-nu5Ss6wJeh8svLxLg", "pri"=>"5", "rep"=>"1", "docs.count"=>"937", "docs.deleted"=>"0", "store.size"=>"0", "pri.store.size"=>"0"}},
 {:reasons=>["too many docs"], :info=>{"health"=>"yellow", "status"=>"open", "index"=>"gitlab-development-merge_requests-20230204-2003", "uuid"=>"J8EosgbPRbKUhndJ_GYaOw", "pri"=>"5", "rep"=>"1", "docs.count"=>"141", "docs.deleted"=>"0", "store.size"=>"0", "pri.store.size"=>"0"}},
 {:reasons=>["too many docs"], :info=>{"health"=>"yellow", "status"=>"open", "index"=>"gitlab-development-issues-20230204-2003", "uuid"=>"I6y5XrhUT3Ct4zEIhMTawA", "pri"=>"5", "rep"=>"1", "docs.count"=>"461", "docs.deleted"=>"0", "store.size"=>"0", "pri.store.size"=>"0"}}]
[24] pry(main)> p.prune(p.bloated_readonly_indices.first)
=> true

After refreshing the notes index with POST gitlab-development-notes*/_refresh and looking at the sizes with GET _cat/indices/gitlab-development-notes*?v

After
health status index                                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   gitlab-development-notes-20230204-2003 uHna-nu5Ss6wJeh8svLxLg   5   1        837            0    154.6kb        154.6kb
yellow open   gitlab-development-notes-20230204-2004 dwx4oT0XSQig6i4bwEbaqA   5   1        100            0        1kb            1kb

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by John Mason

Merge request reports