Skip to content

Update full-text regex to strip out strings with @

Heinrich Lee Yu requested to merge fix-tsvector-timeout into master

What does this MR do and why?

Fixes timeouts seen in gitlab-com/gl-infra/production#6528 (closed)

The exact query that timed out is:

INSERT INTO issue_search_data (project_id, issue_id, search_vector, created_at, updated_at)
SELECT
  project_id,
  id,
  setweight(to_tsvector('english', LEFT(title, 255)), 'A') || setweight(to_tsvector('english', LEFT(REGEXP_REPLACE(description, '[A-Za-z0-9+/]{50,}', ' ', 'g'), 1048576)), 'B'),
  NOW(),
  NOW()
FROM issues
WHERE issues.id IN (SELECT "issues"."id" FROM "issues" WHERE "issues"."id" BETWEEN 89570056 AND 89599068 AND "issues"."id" >= 89590728 AND "issues"."id" < 89590948)
ON CONFLICT DO NOTHING

It's actually the to_tsvector part that's slow and can be reproduced with:

SELECT to_tsvector('english', REPEAT('@t1', 5000));

It gets much slower as the string gets longer. More investigation details in https://gitlab.slack.com/archives/C3NBYFJ6N/p1647415206447629

We haven't really figured out why it is slow but we suspect it may be related to parsing of emails because it only happens with a combination of @, letters, and numbers.

This won't affect indexing of email addresses because those would include a . and that won't match the regex. And to_tsvector is actually fast when there's a period.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Heinrich Lee Yu

Merge request reports