Skip to content

Analysis of where our Elasticsearch storage is mostly used

What

Do an analysis of the current production index and try to assess what are the largest contributions to index size. For example it may be useful to understand the relative impact on index size of:

  • commits
  • code
  • issues
  • merge requests
  • comments

It's tricky to assess this fully considering that all of these are stored in a single Elasticsearch index and as such they will actually share the same lucene indices under the hood but document counts and sizes of the _source may be helpful indicators. To gain deeper understanding we may need to try some experiments where we index a whole project with and without it's comments but this would require exporting a realistic project from these groups and results may not necessarily extrapolate very well depending on how much variation there is in different types of projects.

End goal

See if we can use this information to better forecast how much storage we expect to be used by other groups for the purpose of https://gitlab.com/gitlab-org/gitlab/issues/118571#note_268217520 . Also find out if there is any low hanging fruit to massively reducing storage size. For example:

  • we may wish to not index comments if we found out they contribute massively to the total storage but don't add as much value as global code search
  • we may wish to reduce the maximum file size of indexed files if we see that limiting to 100K could massively reduce index size compared to current 1MB

Plan

  1. Export all the data locally

Store things in separate indices and compare index sizes

This can be interesting for 2 reasons. I believe it's possible we'll use less storage if things are stored in separate indices since different objects have different fields (I'm not 100% sure it helps but it would be interesting to know). Relative sizes of all indices is an interesting data point too.

  1. Re-index everything for each data type
  2. Delete by query from each index the things that should not be in that index
  3. Force merge all indices https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html

Count size as you delete data

  1. Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
  2. Get statistics
  3. Delete all notes
  4. Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
  5. Calculate notes contribution to index size
Size Docs Proportion of total Size reduced Docs reduced
Initial 26.28 GB 4.51 M
Delete wiki blobs 26.28 GB 4.51 M 0.00% 744.06 KB 0.01 K
Delete snippets 26.22 GB 4.51 M 0.23% 60.12 MB 7.29 K
Delete merge requests 26.12 GB 4.38 M 0.38% 99.14 MB 127.37 K
Delete issues 25.67 GB 4.17 M 1.71% 449.20 MB 210.11 K
Delete commits 24.19 GB 2.37 M 5.64% 1.48 GB 1.80 M
Delete notes 23.09 GB 375.91 K 4.18% 1.10 GB 2.00 M
Delete blobs 1.52 MB 4.05 K 87.85% 23.09 GB 371.86 K
Delete milestones 1.16 MB 1.93 K 0.00% 356.67 KB 2.12 K

Does best_compression make much of a difference

It seemed to make things worse.

Correlate repository size to index size

  • Look at the rollout of all previous groups and approximate the relationship
Repository size (GB) Index size added (primary) (GB) Ratio of index size added/repository size Docs added (million)
Customer 1 0.703 2.55 3.627311522 0.1
Customer 2 21.1 28.55 1.353080569 1.4
Customer 3 1.46 3.4 2.328767123 0.1
Customer 4 0.783 1 1.277139208 0.1
Customer 5 94.1 144 1.530286929 3.2
GitLab Org 11 21.2 1.927272727 2.5
GitLab Com 78 7.6 0.09743589744 2.3
Average: 1.734470568
TRIMMEAN: 1.683309311

In Summary

The correlation is all over the place but we average at 1.7 times the repo size. Probably Customer 2 is the most representative with a factor of 1.3.

There are some rounding errors in the others due to me only tracking GB increases and also with such small changes it's within margins of errors of Elasticsearch fluctations due to merges and so I wouldn't really trust this data all that much.

You can see from the comparison of GitLab Org vs. Com the repository size can be hugely inflated to index size if you have many large files in the repo that aren't going to end up in the index because we skip binary files and files larger than 1MB.

Also worth noting this is primary size so we actually use twice as much repository size as described above with replication in our cluster.

Google sheet with analysis

https://docs.google.com/spreadsheets/d/14FhaebcZ0iyLNHiCQyMihSebdgv-3i0vySFo3ihPuss/edit#gid=0

Edited by Dylan Griffith