Split CI minutes resets into different workers (!29017) · Merge requests · GitLab.org / GitLab

Fabio Pitino requested to merge split-clear-shared-runners-minutes-worker into master Apr 07, 2020

What does this MR do?

In #213223 (comment 317200857) we noticed that ClearSharedRunnersMinutesWorker was being killed by Sidekiq Memory Killer because it was running for very long time and consuming a lot of memory due to the amount of data being processed.

This worker does not scale because it keeps processing an always increasing amount of namespaces and projects.

In this MR I've taken a different approach.

ClearSharedRunnersMinutesWorker runs as cronjob on the 1st of every month
Based on the total number of namespaces, it pre-batches work by ID range and executes a new Ci::BatchResetMinutesWorker per ID range. Currently on Gitlab.com there are almost 7M namespaces. So rather than having 1 worker dealing with 7M updates, we have a constant batch size of 100,000 records per worker. In total, as of today, we should create about 70 Ci::BatchResetMinutesWorker
Each Ci::BatchResetMinutesWorker will perform the existing logic of Namespace#reset_ci_minutes(ids) in batches of 1000

Other notable changes:

One of the reasons that caused the the CI minutes not to be fully processed was the exclusive lease of 1 hour that was taken by ClearSharedRunnersMinutesWorker. As Sidekiq Memory Killer killed and restarted the worker after 29 minutes of running, by restarting it the worker exited immediately (successfully) because of the existing exclusive lease. This then caused the worker not to be retried.

In this MR I've also:

not used include CronjobQueue in Ci::BatchResetMinutesWorker as it disables retries. Instead we want to be able to retry the batch if for some reasons it fails (e.g. sql timeouts)
not used the exclusive lease in the new strategy so that if the worker is killed and retried we can still do the processing immediately

Feature flag

This new logic is switched on by default using ci_parallel_minutes_reset feature flag

Query plans

For the sake of understanding the load that ClearSharedRunnersMinutesWorker performs overall. I'm reporting here the query plans of what happens on each batch. Each Ci::BatchResetMinutesWorker receives a range of 100,000 IDs to process.

Let's consider the 2nd instance of Ci::BatchResetMinutesWorker (processing IDs from 100,001 to 200,000).

1. Inside `Namespace.reset_ci_minutes_for_batch!` we use `each_batch` to process the 100,000 namespaces in further batches of `1000`

SELECT “namespaces”.“id” FROM “namespaces” 
WHERE “namespaces”.“id” BETWEEN 100001 AND 200000 AND “namespaces”.“id” >= 100001 
ORDER BY “namespaces”.“id” ASC LIMIT 1 OFFSET 1000

/chatops run explain SELECT “namespaces”.“id” FROM “namespaces” WHERE “namespaces”.“id” BETWEEN 100001 AND 200000 AND “namespaces”.“id” >= 100001 ORDER BY “namespaces”.“id” ASC LIMIT 1 OFFSET 1000

Limit  (cost=189.60..189.79 rows=1 width=4) (actual time=1.686..1.687 rows=1 loops=1)
  Buffers: shared hit=194 read=1
  I/O Timings: read=0.016
  ->  Index Only Scan using namespaces_pkey on namespaces  (cost=0.43..17415.18 rows=92060 width=4) (actual time=0.090..1.622 rows=1001 loops=1)
        Index Cond: ((id >= 100001) AND (id <= 200000) AND (id >= 100001))
        Heap Fetches: 674
        Buffers: shared hit=194 read=1
        I/O Timings: read=0.016
Planning time: 2.378 ms
Execution time: 1.718 ms

2. Then we execute `Namespace.recalculate_extra_shared_runners_minutes_limits!(namespaces)`

UPDATE "namespaces" 
SET extra_shared_runners_minutes_limit = GREATEST((namespaces.shared_runners_minutes_limit + namespaces.extra_shared_runners_minutes_limit) - ROUND(namespace_statistics.shared_runners_seconds / 60.0), 0) 
FROM namespace_statistics 
WHERE "namespaces"."id" BETWEEN 100001 AND 200000 
  AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001 
  AND (namespaces.shared_runners_minutes_limit > 0) 
  AND (namespaces.extra_shared_runners_minutes_limit > 0) 
  AND (namespace_statistics.namespace_id = namespaces.id) 
  AND (namespace_statistics.shared_runners_seconds > (namespaces.shared_runners_minutes_limit * 60));

ModifyTable on public.namespaces  (cost=0.86..1905.35 rows=1 width=354) (actual time=417.958..417.959 rows=0 loops=1)
   Buffers: shared hit=66 read=104 dirtied=1
   I/O Timings: read=416.707
   ->  Nested Loop  (cost=0.86..1905.35 rows=1 width=354) (actual time=417.955..417.955 rows=0 loops=1)
         Buffers: shared hit=66 read=104 dirtied=1
         I/O Timings: read=416.707
         ->  Index Scan using namespaces_pkey on public.namespaces  (cost=0.43..1900.88 rows=1 width=336) (actual time=417.952..417.952 rows=0 loops=1)
               Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
               Filter: ((namespaces.shared_runners_minutes_limit > 0) AND (namespaces.extra_shared_runners_minutes_limit > 0))
               Rows Removed by Filter: 937
               Buffers: shared hit=66 read=104 dirtied=1
               I/O Timings: read=416.707
         ->  Index Scan using index_namespace_statistics_on_namespace_id on public.namespace_statistics  (cost=0.42..4.45 rows=1 width=14) (actual time=0.000..0.000 rows=0 loops=0)
               Index Cond: (namespace_statistics.namespace_id = namespaces.id)
               Filter: (namespace_statistics.shared_runners_seconds > (namespaces.shared_runners_minutes_limit * 60))
               Rows Removed by Filter: 0
Time: 418.864 ms
  - planning: 0.700 ms
  - execution: 418.164 ms
    - I/O read: 416.707 ms
    - I/O write: 0.000 ms
Shared buffers:
  - hits: 66 (~528.00 KiB) from the buffer pool
  - reads: 104 (~832.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

3. Then `Namespace.reset_shared_runners_seconds!(namespaces)` which resets minutes for the namespaces and related projects

UPDATE "namespace_statistics" 
SET "shared_runners_seconds" = 0, "shared_runners_seconds_last_reset" = '2020-04-10 07:39:54.487884' 
WHERE "namespace_statistics"."namespace_id" IN (
    SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001
  ) 
  AND "namespace_statistics"."shared_runners_seconds" != 0;

UPDATE "project_statistics" 
SET "shared_runners_seconds" = 0, "shared_runners_seconds_last_reset" = '2020-04-10 07:39:54.489526' 
WHERE "project_statistics"."namespace_id" IN (
    SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001
  )
  AND "project_statistics"."shared_runners_seconds" != 0;

ModifyTable on public.namespace_statistics  (cost=0.86..5672.88 rows=6 width=32) (actual time=88.383..88.384 rows=0 loops=1)
   Buffers: shared hit=2996 read=71 dirtied=12
   I/O Timings: read=80.028
   ->  Nested Loop  (cost=0.86..5672.88 rows=6 width=32) (actual time=10.935..75.227 rows=5 loops=1)
         Buffers: shared hit=2982 read=67 dirtied=7
         I/O Timings: read=67.303
         ->  Index Scan using namespaces_pkey on public.namespaces  (cost=0.43..1896.08 rows=960 width=10) (actual time=0.018..0.611 rows=937 loops=1)
               Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
               Buffers: shared hit=170
         ->  Index Scan using index_namespace_statistics_on_namespace_id on public.namespace_statistics  (cost=0.42..3.92 rows=1 width=14) (actual time=0.079..0.079 rows=0 loops=937)
               Index Cond: (namespace_statistics.namespace_id = namespaces.id)
               Filter: (namespace_statistics.shared_runners_seconds <> 0)
               Rows Removed by Filter: 0
               Buffers: shared hit=2812 read=67 dirtied=7
               I/O Timings: read=67.303
Time: 89.057 ms
  - planning: 0.612 ms
  - execution: 88.445 ms
    - I/O read: 80.028 ms
    - I/O write: 0.000 ms
Shared buffers:
  - hits: 2996 (~23.40 MiB) from the buffer pool
  - reads: 71 (~568.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 12 (~96.00 KiB)
  - writes: 0

ModifyTable on public.project_statistics  (cost=0.87..32719.46 rows=105 width=96) (actual time=3288.946..3288.946 rows=0 loops=1)
   Buffers: shared hit=3055 read=2827 dirtied=216
   I/O Timings: read=3201.249
   ->  Nested Loop  (cost=0.87..32719.46 rows=105 width=96) (actual time=246.081..3258.271 rows=11 loops=1)
         Buffers: shared hit=3009 read=2807 dirtied=200
         I/O Timings: read=3171.649
         ->  Index Scan using namespaces_pkey on public.namespaces  (cost=0.43..1896.08 rows=960 width=10) (actual time=0.018..2.231 rows=937 loops=1)
               Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
               Buffers: shared hit=170
         ->  Index Scan using index_project_statistics_on_namespace_id on public.project_statistics  (cost=0.43..32.10 rows=1 width=74) (actual time=3.370..3.473 rows=0 loops=937)
               Index Cond: (project_statistics.namespace_id = namespaces.id)
               Filter: (project_statistics.shared_runners_seconds <> 0)
               Rows Removed by Filter: 3
               Buffers: shared hit=2839 read=2807 dirtied=200
               I/O Timings: read=3171.649
Time: 3.290 s
  - planning: 0.573 ms
  - execution: 3.289 s
    - I/O read: 3.201 s
    - I/O write: 0.000 ms
Shared buffers:
  - hits: 3055 (~23.90 MiB) from the buffer pool
  - reads: 2827 (~22.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 216 (~1.70 MiB)
  - writes: 0

4. Finally `Namespace.reset_ci_minutes_notifications!(namespaces)`

UPDATE "namespaces"
SET "last_ci_minutes_notification_at" = NULL, "last_ci_minutes_usage_notification_level" = NULL 
WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001;

ModifyTable on public.namespaces  (cost=0.43..1896.08 rows=960 width=348) (actual time=5323.079..5323.079 rows=0 loops=1)
   Buffers: shared hit=47399 read=4794 dirtied=3819
   I/O Timings: read=5069.302
   ->  Index Scan using namespaces_pkey on public.namespaces  (cost=0.43..1896.08 rows=960 width=348) (actual time=0.031..4.259 rows=937 loops=1)
         Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
         Buffers: shared hit=170
Time: 5.323 s
  - planning: 0.254 ms
  - execution: 5.323 s
    - I/O read: 5.069 s
    - I/O write: 0.000 ms
Shared buffers:
  - hits: 47399 (~370.30 MiB) from the buffer pool
  - reads: 4794 (~37.50 MiB) from the OS file cache, including disk I/O
  - dirtied: 3819 (~29.80 MiB)
  - writes: 0

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Apr 16, 2020 by Mayra Cabrera

Split CI minutes resets into different workers

What does this MR do?

Feature flag

Query plans

1. Inside Namespace.reset_ci_minutes_for_batch! we use each_batch to process the 100,000 namespaces in further batches of 1000

2. Then we execute Namespace.recalculate_extra_shared_runners_minutes_limits!(namespaces)

3. Then Namespace.reset_shared_runners_seconds!(namespaces) which resets minutes for the namespaces and related projects

4. Finally Namespace.reset_ci_minutes_notifications!(namespaces)