Analysis of where our Elasticsearch storage is mostly used

What

Do an analysis of the current production index and try to assess what are the largest contributions to index size. For example it may be useful to understand the relative impact on index size of:

commits
code
issues
merge requests
comments

It's tricky to assess this fully considering that all of these are stored in a single Elasticsearch index and as such they will actually share the same lucene indices under the hood but document counts and sizes of the _source may be helpful indicators. To gain deeper understanding we may need to try some experiments where we index a whole project with and without it's comments but this would require exporting a realistic project from these groups and results may not necessarily extrapolate very well depending on how much variation there is in different types of projects.

End goal

See if we can use this information to better forecast how much storage we expect to be used by other groups for the purpose of https://gitlab.com/gitlab-org/gitlab/issues/118571#note_268217520 . Also find out if there is any low hanging fruit to massively reducing storage size. For example:

we may wish to not index comments if we found out they contribute massively to the total storage but don't add as much value as global code search
we may wish to reduce the maximum file size of indexed files if we see that limiting to 100K could massively reduce index size compared to current 1MB

Plan

Export all the data locally

Store things in separate indices and compare index sizes

This can be interesting for 2 reasons. I believe it's possible we'll use less storage if things are stored in separate indices since different objects have different fields (I'm not 100% sure it helps but it would be interesting to know). Relative sizes of all indices is an interesting data point too.

Re-index everything for each data type
Delete by query from each index the things that should not be in that index
Force merge all indices https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html

Count size as you delete data

Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
Get statistics
Delete all notes
Force merge index https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
Calculate notes contribution to index size

	Size	Docs	Proportion of total	Size reduced	Docs reduced
Initial	26.28 GB	4.51 M
Delete wiki blobs	26.28 GB	4.51 M	0.00%	744.06 KB	0.01 K
Delete snippets	26.22 GB	4.51 M	0.23%	60.12 MB	7.29 K
Delete merge requests	26.12 GB	4.38 M	0.38%	99.14 MB	127.37 K
Delete issues	25.67 GB	4.17 M	1.71%	449.20 MB	210.11 K
Delete commits	24.19 GB	2.37 M	5.64%	1.48 GB	1.80 M
Delete notes	23.09 GB	375.91 K	4.18%	1.10 GB	2.00 M
Delete blobs	1.52 MB	4.05 K	87.85%	23.09 GB	371.86 K
Delete milestones	1.16 MB	1.93 K	0.00%	356.67 KB	2.12 K

Does `best_compression` make much of a difference

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-codec

It seemed to make things worse.

Correlate repository size to index size

Look at the rollout of all previous groups and approximate the relationship

	Repository size (GB)	Index size added (primary) (GB)	Ratio of index size added/repository size	Docs added (million)
Customer 1	0.703	2.55	3.627311522	0.1
Customer 2	21.1	28.55	1.353080569	1.4
Customer 3	1.46	3.4	2.328767123	0.1
Customer 4	0.783	1	1.277139208	0.1
Customer 5	94.1	144	1.530286929	3.2
GitLab Org	11	21.2	1.927272727	2.5
GitLab Com	78	7.6	0.09743589744	2.3
		Average:	1.734470568
		TRIMMEAN:	1.683309311

In Summary

The correlation is all over the place but we average at 1.7 times the repo size. Probably Customer 2 is the most representative with a factor of 1.3.

There are some rounding errors in the others due to me only tracking GB increases and also with such small changes it's within margins of errors of Elasticsearch fluctations due to merges and so I wouldn't really trust this data all that much.

You can see from the comparison of GitLab Org vs. Com the repository size can be hugely inflated to index size if you have many large files in the repo that aren't going to end up in the index because we skip binary files and files larger than 1MB.

Also worth noting this is primary size so we actually use twice as much repository size as described above with replication in our cluster.

Google sheet with analysis

https://docs.google.com/spreadsheets/d/14FhaebcZ0iyLNHiCQyMihSebdgv-3i0vySFo3ihPuss/edit#gid=0

Edited Feb 19, 2020 by Dylan Griffith