Reduce unnecessary duplication in elasticsearch commit index
Currently, elasticsearch indexing treats all repositories as separate, and has no knowledge of fork relationships. That means each commit in a repository gets duplicated for every one of its forks.
This is a lot of redundant data to store - gitlab-ce has almost 3,000 forks and 50,000 commits in master for each. All this is duplicated effort.
In https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1480 , we make use of some novel elasticsearch functionality - an indexed array of integers.
Can we modify our elasticsearch index so that commits go into a common index, with the list of projects that contain that commit contained in an integer array attached to each document? Would we be able to atomically update that array? Would querying based on its contents be efficient if it contains thousands of entries?
/cc @smcgivern @vsizov