Skip to content

Populate canonical emails

charlie ablett requested to merge 205383-canonical-migration into master

What does this MR do?

Continuation of !27722 (merged).

Background migration to generate a canonical email based on the user's primary email.

Canonical means the Agent part of the email address omitting . and anything after any +.

Scoped to Gmail since they are a service that allows . and ignores anything after + in the Agent and all variations arrive in the same inbox.

According to the query below, there are 2611059 *@gmail.com addresses on gitlab.com.

If every minute it processes 1000 rows, and there are 2.6 million gmail addresses, that's 2600 minutes or 43 hours (just under 2 days).

DB query plan for all users with `@gmail.com` domains Query: `explain SELECT * FROM users WHERE email LIKE ‘%gmail.com%’ ORDER BY users.id ASC LIMIT 1 OFFSET 100;`
Limit  (cost=24.26..24.50 rows=1 width=1202) (actual time=5.004..5.005 rows=1 loops=1)
  Buffers: shared hit=156 read=5
  I/O Timings: read=1.695
  ->  Index Scan using users_pkey on users  (cost=0.43..622171.36 rows=2611059 width=1202) (actual time=0.188..4.994 rows=101 loops=1)
        Filter: ((email)::text ~~ '%gmail.com%'::text)
        Rows Removed by Filter: 58
        Buffers: shared hit=156 read=5
        I/O Timings: read=1.695
Planning time: 12.575 ms
Execution time: 5.100 ms

Related to https://gitlab.com/gitlab-org/gitlab/issues/205383

Screenshots

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

Closes #205383

Edited by charlie ablett

Merge request reports