Consider replacing SHA term matching on ngrams with prefix search

Summary

Currently, SHAs are indexed using ngrams from 5 to 40 characters. This means that each SHA is split into 35 separate terms taking up a lot of storage. SHAs are quite unique from 4-5 characters on, so a simple prefix search will be sufficiently fast and as effective as ngrams with term matching.

Improvements

Replacing the ngram analyzer with lowercase keywords and using a prefix search in the code. This reduced the index size ~13% in our tests.

Risks

GitLab's search needs to issue a prefix search instead a term match query, this will be a bit slower than term matching and increases the complexity of the client (as the fields containing a SHA need to be queried differently than the rest).

Elasticsearch mapping used for our tests: replace_sha_ngrams_with_keyword.json

Relates to #3327 (closed)

Edited May 31, 2022 by 🤖 GitLab Bot 🤖