Skip to content

Prepare partial non-latin index for issues

Heinrich Lee Yu requested to merge 364556-prepare-indexes-again into master

What does this MR do and why?

This is a 2nd attempt of !93002 (merged)

That previous index was reverted because the regex I used was incorrect. This index matches rows that have non-latin text, but it mistakenly excluded Russian text because I thought the Unicode ranges in https://en.wikipedia.org/wiki/Latin_script_in_Unicode were contiguous. (Cyrillic script is in U+0400-U+04FF)

I have now updated the ranges to only include the Latin blocks listed in that page. As noted in !92739 (comment 1031755206), some blocks are excluded because these are only used in historical languages. So the regex won't get too long.

I also excluded U+1D00-U+1DBF because there's some Cyrillic characters there. These are not being used in normal english text anyway. If there is an issue that contains these characters, they will be searched using this trigram index and not the fulltext one which should still be fine.

This also means the index will be a little larger than before but we still save a considerable amount in storage for these indexes:

gitlabhq_dblab=# SELECT pg_size_pretty(pg_total_relation_size('index_issues_on_title_trigram')), pg_size_pretty(pg_total_relation_size('index_issues_on_title_trigram_non_latin'));
 pg_size_pretty | pg_size_pretty
----------------+----------------
 8431 MB        | 1295 MB
(1 row)

gitlabhq_dblab=# SELECT pg_size_pretty(pg_total_relation_size('index_issues_on_description_trigram')), pg_size_pretty(pg_total_relation_size('index_issues_on_description_trigram_non_latin'));
 pg_size_pretty | pg_size_pretty
----------------+----------------
 53 GB          | 8286 MB
(1 row)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #364556 (closed)

Edited by Heinrich Lee Yu

Merge request reports