Skip to content

Change more Elasticsearch indexes to keyword type

What does this MR do?

Related to #213035 (closed) .

The Elasticsearch keyword type "is used for structured content such as IDs, email addresses, hostnames, status codes, zip codes, or tags". This index is preferred over the current text type as the text type takes up more storage.

The text type splits up the text as though it was human readable text (ie. splitting words apart) and indexes each word separately in the inverted index. As such the text type will usually take up more space in the inverted index and should only be used when you need to search for individual words in the text.

For each of these cases this is not adding any value and possibly making certain searches incorrect. After testing locally this change appears to save ~4% disk storage.

Impact for each field

As per #213035 (comment 439629162) here is the reasoning on a per field basis:

  1. state/merge_status => We only do exact matches against this for filtering. It's only 1 word so changing to keyword won't make any difference
  2. target_branch/source_branch => these are not used in any searches today so there is no risk to changing the index options. Changing this to keyword should have a decent storage improvement as these can be quite long and composed of many words
  3. merge_status => this is not used in any searches today so there is no risk to changing the index options. This appears to be things like can_be_merged/cannot_be_merged/unchecked which implies to me that it should be a keyword anyway as splitting this by word will be producing wrong results if we ever did filter on it and it will save some storage.
  4. commit.(commiter/author).email => this is used in commit searches today and it's hard to know exactly how this might be used by our current users.Users will lose some behaviour though if they were searching for partial email addresses before. For example you can search for dyl.griffith and you will find commits authored by my email address which starts with dyl.griffith. After this change to use keyword you'd need to search for the entire exact email address or you could use the prefix search dyl.griffith* as well. However, since prefix searches are (wildcards) can only be used at the end of the word you will not be able to search for griffith only after this change

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Related to #213035 (closed)

Edited by Dylan Griffith

Merge request reports