Avoid conflicts between ArchiveTracesCronWorker and ArchiveTraceWorker (!31376) · Merge requests · GitLab.org / GitLab FOSS

Shinya Maeda requested to merge avoid-race-condition-of-archive-trace-cron-worker into master Aug 01, 2019

What does this MR do?

ArchiveTraceWorker runs after pipeline job finished as a generic lifecycle. It archives a live trace and stores the trace data in a permanent storage. This process might fail when external services are not operational, for example, S3 has an incident, then the live trace stays intact and archive process via ArchiveTraceWorker will not happen again.

ArchiveTracesCronWorker runs periodically as a Cron Worker for archiving live traces. This worker's main purpose is to rescue the stale live traces which could not have been archived by ArchiveTraceWorker. It runs once per hour not to leave unarchived behind.

Now the problem is that ArchiveTraceWorker and ArchiveTracesCronWorker could run simultaneously and cause a race condition. We suspect that this race condition cause a potential production data loss, we avoid the case by explicitly targeting stale live traces in ArchiveTracesCronWorker.

Query Plan for `Ci::Build.with_stale_live_trace.find_each(batch_size: 100)`

Before

 Limit  (cost=6718.91..6719.16 rows=100 width=1388) (actual time=10.097..10.123 rows=100 loops=1)
   Buffers: shared hit=3671
   ->  Sort  (cost=6718.91..6720.77 rows=742 width=1388) (actual time=10.096..10.113 rows=100 loops=1)
         Sort Key: ci_builds.id
         Sort Method: top-N heapsort  Memory: 172kB
         Buffers: shared hit=3671
         ->  Nested Loop  (cost=2243.47..6690.56 rows=742 width=1388) (actual time=3.255..9.283 rows=615 loops=1)
               Buffers: shared hit=3668
               ->  HashAggregate  (cost=2242.90..2252.54 rows=964 width=4) (actual time=3.229..3.327 rows=622 loops=1)
                     Group Key: ci_build_trace_chunks.build_id
                     Buffers: shared hit=555
                     ->  Seq Scan on public.ci_build_trace_chunks  (cost=0.00..2238.32 rows=1832 width=4) (actual time=0.007..2.786 rows=1179 loops=1)
                           Buffers: shared hit=555
               ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.57..4.59 rows=1 width=1388) (actual time=0.009..0.009 rows=1 loops=622)
                     Index Cond: (ci_builds.id = ci_build_trace_chunks.build_id)
                     Filter: (((ci_builds.type)::text = 'Ci::Build'::text) AND ((ci_builds.status)::text = ANY ('{success,failed,canceled}'::text[])))
                     Rows Removed by Filter: 0
                     Buffers: shared hit=3113
 Planning time: 7.027 ms
 Execution time: 10.188 ms
 Total Cost: 6720.77
 Buffers Hit: 3671
 Buffers Written: 0
 Buffers Read: 0

After

 Limit  (cost=6717.63..6717.88 rows=100 width=1388) (actual time=9.521..9.552 rows=100 loops=1)
   Buffers: shared hit=3671
   ->  Sort  (cost=6717.63..6719.48 rows=740 width=1388) (actual time=9.519..9.536 rows=100 loops=1)
         Sort Key: ci_builds.id
         Sort Method: top-N heapsort  Memory: 172kB
         Buffers: shared hit=3671
         ->  Nested Loop  (cost=2243.47..6689.35 rows=740 width=1388) (actual time=3.050..8.745 rows=615 loops=1)
               Buffers: shared hit=3668
               ->  HashAggregate  (cost=2242.90..2252.54 rows=964 width=4) (actual time=3.020..3.141 rows=622 loops=1)
                     Group Key: ci_build_trace_chunks.build_id
                     Buffers: shared hit=555
                     ->  Seq Scan on public.ci_build_trace_chunks  (cost=0.00..2238.32 rows=1832 width=4) (actual time=0.008..2.595 rows=1179 loops=1)
                           Buffers: shared hit=555
               ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.57..4.59 rows=1 width=1388) (actual time=0.008..0.009 rows=1 loops=622)
                     Index Cond: (ci_builds.id = ci_build_trace_chunks.build_id)
                     Filter: ((ci_builds.finished_at < '2019-08-01 15:46:08.23754'::timestamp without time zone) AND ((ci_builds.type)::text = 'Ci::Build'::text))
                     Rows Removed by Filter: 0
                     Buffers: shared hit=3113
 Planning time: 7.061 ms
 Execution time: 9.619 ms
 Total Cost: 6719.48
 Buffers Hit: 3671
 Buffers Written: 0
 Buffers Read: 0

Does this MR meet the acceptance criteria?

Conformity

Changelog entry for user-facing changes, or community contribution. Check the link for other scenarios.
[-] Documentation created/updated or follow-up review issue created
Code review guidelines
Merge request performance guidelines
Style guides
Database guides
Separation of EE specific content

Performance and testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Aug 20, 2019 by Shinya Maeda

Admin message

Avoid conflicts between ArchiveTracesCronWorker and ArchiveTraceWorker