Skip to content

Speed up UpdateAllMirrorsWorker query on GitLab.com

Sean McGivern requested to merge speed-up-mirrors-check-query into master

What does this MR do?

Speed up UpdateAllMirrorsWorker query on GitLab.com

On GitLab.com, we check for the license on a project-by-project basis, as different namespaces can have different licenses. Pull mirroring is a Bronze feature, meaning it should only be available to:

  1. Paid projects.
  2. Public projects.

Previously, we offered pull mirroring for free even on private projects. After we stopped offering that, we started checking the plans in the query to fetch mirrors, to avoid fetching lots of projects we were going to do nothing with.

However, this query got slower over time, because we order by next_execution_timestamp, and next_execution_timestamp is stuck at some point before 2020-03-28 for these free private mirrors until the project is made public, or paid for. That means that the database has to do a large amount of work discarding those projects that we know we will never process.

For GitLab.com ONLY, we work around this by explicitly filtering on next_execution_timestamp > '2020-03-28'. This makes the query go from taking several seconds, to around 150 milliseconds.

In the long term, we should look to remove this workaround and simplify the query: #216783 (closed)

For #216252 (closed).

Query timings

Note that because we're constantly processing mirrors on GitLab.com, you'll never get the same results from two executions of the same query. First, let's check in a Ruby console:

# Current
UpdateAllMirrorsWorker.new.send(:pull_mirrors_batch, freeze_at: Time.now, batch_size: 500, offset_at: nil).count
# D, [2020-05-06T13:06:40.714017 #30186] DEBUG -- :    (2453.9ms)  SELECT COUNT(*) FROM (SELECT 1 AS one FROM "projects" INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id INNER JOIN namespaces AS root_namespaces ON root_namespaces.id = (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" WHERE (id = projects.namespace_id))
# UNION
# (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT "id" FROM "base_and_ancestors" AS "namespaces" WHERE "namespaces"."parent_id" IS NULL) LEFT JOIN gitlab_subscriptions ON gitlab_subscriptions.namespace_id = root_namespaces.id LEFT JOIN plans ON plans.id = gitlab_subscriptions.hosted_plan_id WHERE "projects"."archived" = FALSE AND "projects"."mirror" = TRUE AND "import_state"."status" NOT IN ('scheduled', 'started') AND (import_state.next_execution_timestamp <= '2020-05-06 13:06:38.238744') AND (import_state.retry_count <= 14) AND (plans.name IN ('bronze','silver','gold','early_adopter') OR projects.visibility_level = 20) ORDER BY import_state.next_execution_timestamp LIMIT 500) subquery_for_count /*application:console*/
# => 282

# New
UpdateAllMirrorsWorker.new.send(:pull_mirrors_batch, freeze_at: Time.now, batch_size: 500, offset_at: Time.utc(2020, 3, 28)).count
# D, [2020-05-06T13:07:29.339924 #30186] DEBUG -- :    (135.4ms)  SELECT COUNT(*) FROM (SELECT 1 AS one FROM "projects" INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id INNER JOIN namespaces AS root_namespaces ON root_namespaces.id = (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" WHERE (id = projects.namespace_id))
# UNION
# (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT "id" FROM "base_and_ancestors" AS "namespaces" WHERE "namespaces"."parent_id" IS NULL) LEFT JOIN gitlab_subscriptions ON gitlab_subscriptions.namespace_id = root_namespaces.id LEFT JOIN plans ON plans.id = gitlab_subscriptions.hosted_plan_id WHERE "projects"."archived" = FALSE AND "projects"."mirror" = TRUE AND "import_state"."status" NOT IN ('scheduled', 'started') AND (import_state.next_execution_timestamp <= '2020-05-06 13:07:29.198020') AND (import_state.retry_count <= 14) AND (import_state.next_execution_timestamp > '2020-03-28 00:00:00') AND (plans.name IN ('bronze','silver','gold','early_adopter') OR projects.visibility_level = 20) ORDER BY import_state.next_execution_timestamp LIMIT 500) subquery_for_count /*application:console*/
# => 276

And query plans:

Before: https://explain.depesz.com/s/18cr
                                                                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=357.35..445148.74 rows=500 width=758) (actual time=543.520..2497.460 rows=498 loops=1)
   ->  Nested Loop Left Join  (cost=357.35..1197735.79 rows=1346 width=758) (actual time=543.518..2497.387 rows=498 loops=1)
         Join Filter: (plans.id = gitlab_subscriptions.hosted_plan_id)
         Rows Removed by Join Filter: 275420
         Filter: (((plans.name)::text = ANY ('{bronze,silver,gold,early_adopter}'::text[])) OR (projects.visibility_level = 20))
         Rows Removed by Filter: 54582
         ->  Nested Loop Left Join  (cost=357.35..1197475.62 rows=1897 width=762) (actual time=231.271..2416.670 rows=55080 loops=1)
               ->  Nested Loop  (cost=356.91..1196582.11 rows=1897 width=762) (actual time=231.258..2000.251 rows=55080 loops=1)
                     ->  Nested Loop  (cost=0.99..520604.17 rows=1897 width=758) (actual time=231.121..949.211 rows=55080 loops=1)
                           ->  Index Scan using index_mirror_data_on_next_execution_and_retry_count on project_mirror_data import_state  (cost=0.56..152838.83 rows=148151 width=12) (actual time=0.587..263.207 rows=75641 loops=1)
                                 Index Cond: ((next_execution_timestamp <= now()) AND (retry_count <= 14))
                                 Filter: ((status)::text <> ALL ('{scheduled,started}'::text[]))
                                 Rows Removed by Filter: 98
                           ->  Index Scan using projects_pkey on projects  (cost=0.43..2.47 rows=1 width=750) (actual time=0.008..0.009 rows=1 loops=75641)
                                 Index Cond: (id = import_state.project_id)
                                 Filter: ((NOT archived) AND mirror)
                                 Rows Removed by Filter: 0
                     ->  Index Only Scan using namespaces_pkey on namespaces root_namespaces  (cost=355.92..356.33 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=55080)
                           Index Cond: (id = (SubPlan 2))
                           Heap Fetches: 4028
                           SubPlan 2
                             ->  CTE Scan on base_and_ancestors namespaces_2  (cost=353.47..355.49 rows=1 width=4) (actual time=0.015..0.016 rows=1 loops=55080)
                                   Filter: (parent_id IS NULL)
                                   Rows Removed by Filter: 0
                                   CTE base_and_ancestors
                                     ->  Recursive Union  (cost=0.43..353.47 rows=101 width=334) (actual time=0.010..0.014 rows=1 loops=55080)
                                           ->  Index Scan using namespaces_pkey on namespaces  (cost=0.43..3.45 rows=1 width=334) (actual time=0.008..0.008 rows=1 loops=55080)
                                                 Index Cond: (id = projects.namespace_id)
                                           ->  Nested Loop  (cost=0.43..34.80 rows=10 width=334) (actual time=0.002..0.002 rows=0 loops=70469)
                                                 ->  WorkTable Scan on base_and_ancestors  (cost=0.00..0.20 rows=10 width=4) (actual time=0.000..0.000 rows=1 loops=70469)
                                                 ->  Index Scan using namespaces_pkey on namespaces namespaces_1  (cost=0.43..3.45 rows=1 width=334) (actual time=0.002..0.002 rows=0 loops=70469)
                                                       Index Cond: (id = base_and_ancestors.parent_id)
               ->  Index Scan using index_gitlab_subscriptions_on_namespace_id on gitlab_subscriptions  (cost=0.43..0.46 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=55080)
                     Index Cond: (namespace_id = root_namespaces.id)
         ->  Materialize  (cost=0.00..4.09 rows=6 width=11) (actual time=0.000..0.000 rows=6 loops=55080)
               ->  Seq Scan on plans  (cost=0.00..4.06 rows=6 width=11) (actual time=0.007..0.008 rows=6 loops=1)
 Planning time: 9.597 ms
 Execution time: 2498.059 ms
(38 rows)
After: https://explain.depesz.com/s/D69X
                                                                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=357.35..76740.62 rows=76 width=758) (actual time=39.528..79.398 rows=229 loops=1)
   ->  Nested Loop Left Join  (cost=357.35..76740.62 rows=76 width=758) (actual time=39.528..79.364 rows=229 loops=1)
         Join Filter: (plans.id = gitlab_subscriptions.hosted_plan_id)
         Rows Removed by Join Filter: 3428
         Filter: (((plans.name)::text = ANY ('{bronze,silver,gold,early_adopter}'::text[])) OR (projects.visibility_level = 20))
         Rows Removed by Filter: 455
         ->  Nested Loop Left Join  (cost=357.35..76722.10 rows=107 width=762) (actual time=0.283..78.337 rows=684 loops=1)
               ->  Nested Loop  (cost=356.91..76671.70 rows=107 width=762) (actual time=0.255..63.073 rows=684 loops=1)
                     ->  Nested Loop  (cost=0.99..38543.27 rows=107 width=758) (actual time=0.198..48.306 rows=684 loops=1)
                           ->  Index Scan using index_mirror_data_on_next_execution_and_retry_count on project_mirror_data import_state  (cost=0.56..10472.44 rows=8338 width=12) (actual time=0.042..30.453 rows=2203 loops=1)
                                 Index Cond: ((next_execution_timestamp <= now()) AND (next_execution_timestamp > '2020-03-28 00:00:00+00'::timestamp with time zone) AND (retry_count <= 14))
                                 Filter: ((status)::text <> ALL ('{scheduled,started}'::text[]))
                                 Rows Removed by Filter: 29
                           ->  Index Scan using projects_pkey on projects  (cost=0.43..3.36 rows=1 width=750) (actual time=0.008..0.008 rows=0 loops=2203)
                                 Index Cond: (id = import_state.project_id)
                                 Filter: ((NOT archived) AND mirror)
                                 Rows Removed by Filter: 1
                     ->  Index Only Scan using namespaces_pkey on namespaces root_namespaces  (cost=355.92..356.33 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=684)
                           Index Cond: (id = (SubPlan 2))
                           Heap Fetches: 146
                           SubPlan 2
                             ->  CTE Scan on base_and_ancestors namespaces_2  (cost=353.47..355.49 rows=1 width=4) (actual time=0.017..0.018 rows=1 loops=684)
                                   Filter: (parent_id IS NULL)
                                   Rows Removed by Filter: 1
                                   CTE base_and_ancestors
                                     ->  Recursive Union  (cost=0.43..353.47 rows=101 width=334) (actual time=0.010..0.016 rows=2 loops=684)
                                           ->  Index Scan using namespaces_pkey on namespaces  (cost=0.43..3.45 rows=1 width=334) (actual time=0.008..0.008 rows=1 loops=684)
                                                 Index Cond: (id = projects.namespace_id)
                                           ->  Nested Loop  (cost=0.43..34.80 rows=10 width=334) (actual time=0.002..0.003 rows=0 loops=1089)
                                                 ->  WorkTable Scan on base_and_ancestors  (cost=0.00..0.20 rows=10 width=4) (actual time=0.000..0.000 rows=1 loops=1089)
                                                 ->  Index Scan using namespaces_pkey on namespaces namespaces_1  (cost=0.43..3.45 rows=1 width=334) (actual time=0.002..0.002 rows=0 loops=1089)
                                                       Index Cond: (id = base_and_ancestors.parent_id)
               ->  Index Scan using index_gitlab_subscriptions_on_namespace_id on gitlab_subscriptions  (cost=0.43..0.46 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=684)
                     Index Cond: (namespace_id = root_namespaces.id)
         ->  Materialize  (cost=0.00..4.09 rows=6 width=11) (actual time=0.000..0.000 rows=6 loops=684)
               ->  Seq Scan on plans  (cost=0.00..4.06 rows=6 width=11) (actual time=0.006..0.008 rows=6 loops=1)
 Planning time: 9.975 ms
 Execution time: 79.867 ms
(38 rows)

The key note from the query plans is the number of loops in 'CTE base_and_ancestors' - #216252 (comment 334514544) demonstrates why this is dropping so much.

Edited by Toon Claes

Merge request reports