Add repository shard size calculation to rake task
What does this MR do and why?
This change improves how GitLab estimates the optimal number of shards (data partitions) for its search clusters. Previously, the system only calculated shard recommendations for database-backed content like issues and merge requests, requiring users to run a separate command for repository data like code and wikis.
The update consolidates this into a single command that now provides shard size recommendations for all types of searchable content, including repositories. It also enhances the documentation by simplifying the guidance - removing the previous distinction between database and repository indices, and adding new information about replica configuration when using shard allocation awareness.
The technical implementation adds logic to calculate repository-based shard sizes using storage statistics with different multipliers for different content types (code, commits, wikis), and updates the output format to be more consistent and informative. This makes it easier for administrators to properly configure their search infrastructure without needing to understand the underlying technical differences between content types.
References
- Related to Advanced Search: Estimate the number of shards ... (#348452 - closed)
- Internal discussion #348452 (comment 2668372829)
Screenshots or screen recordings
Here are a few scenarios to evaluate if the recommended shard sizes make sense
| Repository size | Wiki repository size | code (main index) | wiki index | commits index |
|---|---|---|---|---|
| 10GB | 1GB | 5 | 5 | 5 |
| 50GB | 2GB | 5 | 5 | 5 |
| 100GB | 5GB | 5 | 5 | 5 |
| 500GB | 50GB | 8 | 5 | 5 |
| 1TB | 100GB | 17 | 10 | 10 |
| 10TB | 500GB | 170 | 50 | 102 |
How to set up and validate locally
- enable advanced search
- run the rake task
bundle exec rake gitlab:elastic:estimate_shard_sizes
example output
Using database and storage statistics to estimate shard counts and approximate document counts. Approximate document counts are not available for repository data. This estimate does not take into account advanced search indexing restrictions, see https://gdk.test:3443/help/integration/advanced_search/elasticsearch.md#limit-the-amount-of-namespace-and-project-data-to-index.For single-node cluster recommendations, see https://gdk.test:3443/help/integration/advanced_search/elasticsearch.md#number-of-elasticsearch-shards.
The approximate document counts, recommended shard size, and replica size for each index are:
- gitlab-development-issues:
recommended shards: 5
recommended replicas: 1
document count: 613
- gitlab-development-notes:
recommended shards: 5
recommended replicas: 1
document count: 1,248
- gitlab-development-merge_requests:
recommended shards: 5
recommended replicas: 1
document count: 141
- gitlab-development-epics:
recommended shards: 5
recommended replicas: 1
document count: 3
- gitlab-development-users:
recommended shards: 5
recommended replicas: 1
document count: 573
- gitlab-development-projects:
recommended shards: 5
recommended replicas: 1
document count: 58,841
- gitlab-development-work_items:
recommended shards: 5
recommended replicas: 1
document count: 613
- gitlab-development-vulnerabilities:
recommended shards: 5
recommended replicas: 1
document count: 540
- gitlab-development:
recommended shards: 5
recommended replicas: 1
- gitlab-development-commits:
recommended shards: 5
recommended replicas: 1
- gitlab-development-wikis:
recommended shards: 5
recommended replicas: 1
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.