Add repository shard size calculation to rake task

What does this MR do and why?

This change improves how GitLab estimates the optimal number of shards (data partitions) for its search clusters. Previously, the system only calculated shard recommendations for database-backed content like issues and merge requests, requiring users to run a separate command for repository data like code and wikis.

The update consolidates this into a single command that now provides shard size recommendations for all types of searchable content, including repositories. It also enhances the documentation by simplifying the guidance - removing the previous distinction between database and repository indices, and adding new information about replica configuration when using shard allocation awareness.

The technical implementation adds logic to calculate repository-based shard sizes using storage statistics with different multipliers for different content types (code, commits, wikis), and updates the output format to be more consistent and informative. This makes it easier for administrators to properly configure their search infrastructure without needing to understand the underlying technical differences between content types.

References

Screenshots or screen recordings

Here are a few scenarios to evaluate if the recommended shard sizes make sense

Repository size Wiki repository size code (main index) wiki index commits index
10GB 1GB 5 5 5
50GB 2GB 5 5 5
100GB 5GB 5 5 5
500GB 50GB 8 5 5
1TB 100GB 17 10 10
10TB 500GB 170 50 102

How to set up and validate locally

  • enable advanced search
  • run the rake task
      bundle exec rake gitlab:elastic:estimate_shard_sizes

example output

Using database and storage statistics to estimate shard counts and approximate document counts. Approximate document counts are not available for repository data. This estimate does not take into account advanced search indexing restrictions, see https://gdk.test:3443/help/integration/advanced_search/elasticsearch.md#limit-the-amount-of-namespace-and-project-data-to-index.For single-node cluster recommendations, see https://gdk.test:3443/help/integration/advanced_search/elasticsearch.md#number-of-elasticsearch-shards.

The approximate document counts, recommended shard size, and replica size for each index are:

- gitlab-development-issues:
  recommended shards: 5
  recommended replicas: 1
  document count: 613

- gitlab-development-notes:
  recommended shards: 5
  recommended replicas: 1
  document count: 1,248

- gitlab-development-merge_requests:
  recommended shards: 5
  recommended replicas: 1
  document count: 141

- gitlab-development-epics:
  recommended shards: 5
  recommended replicas: 1
  document count: 3

- gitlab-development-users:
  recommended shards: 5
  recommended replicas: 1
  document count: 573

- gitlab-development-projects:
  recommended shards: 5
  recommended replicas: 1
  document count: 58,841

- gitlab-development-work_items:
  recommended shards: 5
  recommended replicas: 1
  document count: 613

- gitlab-development-vulnerabilities:
  recommended shards: 5
  recommended replicas: 1
  document count: 540

- gitlab-development:
  recommended shards: 5
  recommended replicas: 1

- gitlab-development-commits:
  recommended shards: 5
  recommended replicas: 1

- gitlab-development-wikis:
  recommended shards: 5
  recommended replicas: 1

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Terri Chu

Merge request reports

Loading