Analysis of where our Elasticsearch storage is mostly used
What
Do an analysis of the current production index and try to assess what are the largest contributions to index size. For example it may be useful to understand the relative impact on index size of:
- commits
- code
- issues
- merge requests
- comments
It's tricky to assess this fully considering that all of these are stored in a single Elasticsearch index and as such they will actually share the same lucene indices under the hood but document counts and sizes of the _source
may be helpful indicators. To gain deeper understanding we may need to try some experiments where we index a whole project with and without it's comments but this would require exporting a realistic project from these groups and results may not necessarily extrapolate very well depending on how much variation there is in different types of projects.
End goal
See if we can use this information to better forecast how much storage we expect to be used by other groups for the purpose of https://gitlab.com/gitlab-org/gitlab/issues/118571#note_268217520 . Also find out if there is any low hanging fruit to massively reducing storage size. For example:
- we may wish to not index comments if we found out they contribute massively to the total storage but don't add as much value as global code search
- we may wish to reduce the maximum file size of indexed files if we see that limiting to 100K could massively reduce index size compared to current 1MB
Plan
-
Export all the data locally
Store things in separate indices and compare index sizes
This can be interesting for 2 reasons. I believe it's possible we'll use less storage if things are stored in separate indices since different objects have different fields (I'm not 100% sure it helps but it would be interesting to know). Relative sizes of all indices is an interesting data point too.
-
Re-index everything for each data type -
Delete by query from each index the things that should not be in that index -
Force merge all indices https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
Count size as you delete data
-
Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html -
Get statistics -
Delete all notes -
Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html -
Calculate notes contribution to index size
Size | Docs | Proportion of total | Size reduced | Docs reduced | |
---|---|---|---|---|---|
Initial | 26.28 GB | 4.51 M | |||
Delete wiki blobs | 26.28 GB | 4.51 M | 0.00% | 744.06 KB | 0.01 K |
Delete snippets | 26.22 GB | 4.51 M | 0.23% | 60.12 MB | 7.29 K |
Delete merge requests | 26.12 GB | 4.38 M | 0.38% | 99.14 MB | 127.37 K |
Delete issues | 25.67 GB | 4.17 M | 1.71% | 449.20 MB | 210.11 K |
Delete commits | 24.19 GB | 2.37 M | 5.64% | 1.48 GB | 1.80 M |
Delete notes | 23.09 GB | 375.91 K | 4.18% | 1.10 GB | 2.00 M |
Delete blobs | 1.52 MB | 4.05 K | 87.85% | 23.09 GB | 371.86 K |
Delete milestones | 1.16 MB | 1.93 K | 0.00% | 356.67 KB | 2.12 K |
best_compression
make much of a difference
Does
It seemed to make things worse.
Correlate repository size to index size
-
Look at the rollout of all previous groups and approximate the relationship
Repository size (GB) | Index size added (primary) (GB) | Ratio of index size added/repository size | Docs added (million) | |
---|---|---|---|---|
Customer 1 | 0.703 | 2.55 | 3.627311522 | 0.1 |
Customer 2 | 21.1 | 28.55 | 1.353080569 | 1.4 |
Customer 3 | 1.46 | 3.4 | 2.328767123 | 0.1 |
Customer 4 | 0.783 | 1 | 1.277139208 | 0.1 |
Customer 5 | 94.1 | 144 | 1.530286929 | 3.2 |
GitLab Org | 11 | 21.2 | 1.927272727 | 2.5 |
GitLab Com | 78 | 7.6 | 0.09743589744 | 2.3 |
Average: | 1.734470568 | |||
TRIMMEAN: | 1.683309311 |
In Summary
The correlation is all over the place but we average at 1.7
times the repo size. Probably Customer 2 is the most representative with a factor of 1.3.
There are some rounding errors in the others due to me only tracking GB increases and also with such small changes it's within margins of errors of Elasticsearch fluctations due to merges and so I wouldn't really trust this data all that much.
You can see from the comparison of GitLab Org vs. Com the repository size can be hugely inflated to index size if you have many large files in the repo that aren't going to end up in the index because we skip binary files and files larger than 1MB.
Also worth noting this is primary size so we actually use twice as much repository size as described above with replication in our cluster.
Google sheet with analysis
https://docs.google.com/spreadsheets/d/14FhaebcZ0iyLNHiCQyMihSebdgv-3i0vySFo3ihPuss/edit#gid=0