Skip to content

Avoid conflicts between ArchiveTracesCronWorker and ArchiveTraceWorker

What does this MR do?

ArchiveTraceWorker runs after pipeline job finished as a generic lifecycle. It archives a live trace and stores the trace data in a permanent storage. This process might fail when external services are not operational, for example, S3 has an incident, then the live trace stays intact and archive process via ArchiveTraceWorker will not happen again.

ArchiveTracesCronWorker runs periodically as a Cron Worker for archiving live traces. This worker's main purpose is to rescue the stale live traces which could not have been archived by ArchiveTraceWorker. It runs once per hour not to leave unarchived behind.

Now the problem is that ArchiveTraceWorker and ArchiveTracesCronWorker could run simultaneously and cause a race condition. We suspect that this race condition cause a potential production data loss, we avoid the case by explicitly targeting stale live traces in ArchiveTracesCronWorker.

Query Plan for Ci::Build.with_stale_live_trace.find_each(batch_size: 100)

Before

 Limit  (cost=6718.91..6719.16 rows=100 width=1388) (actual time=10.097..10.123 rows=100 loops=1)
   Buffers: shared hit=3671
   ->  Sort  (cost=6718.91..6720.77 rows=742 width=1388) (actual time=10.096..10.113 rows=100 loops=1)
         Sort Key: ci_builds.id
         Sort Method: top-N heapsort  Memory: 172kB
         Buffers: shared hit=3671
         ->  Nested Loop  (cost=2243.47..6690.56 rows=742 width=1388) (actual time=3.255..9.283 rows=615 loops=1)
               Buffers: shared hit=3668
               ->  HashAggregate  (cost=2242.90..2252.54 rows=964 width=4) (actual time=3.229..3.327 rows=622 loops=1)
                     Group Key: ci_build_trace_chunks.build_id
                     Buffers: shared hit=555
                     ->  Seq Scan on public.ci_build_trace_chunks  (cost=0.00..2238.32 rows=1832 width=4) (actual time=0.007..2.786 rows=1179 loops=1)
                           Buffers: shared hit=555
               ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.57..4.59 rows=1 width=1388) (actual time=0.009..0.009 rows=1 loops=622)
                     Index Cond: (ci_builds.id = ci_build_trace_chunks.build_id)
                     Filter: (((ci_builds.type)::text = 'Ci::Build'::text) AND ((ci_builds.status)::text = ANY ('{success,failed,canceled}'::text[])))
                     Rows Removed by Filter: 0
                     Buffers: shared hit=3113
 Planning time: 7.027 ms
 Execution time: 10.188 ms
 Total Cost: 6720.77
 Buffers Hit: 3671
 Buffers Written: 0
 Buffers Read: 0

After

 Limit  (cost=6717.63..6717.88 rows=100 width=1388) (actual time=9.521..9.552 rows=100 loops=1)
   Buffers: shared hit=3671
   ->  Sort  (cost=6717.63..6719.48 rows=740 width=1388) (actual time=9.519..9.536 rows=100 loops=1)
         Sort Key: ci_builds.id
         Sort Method: top-N heapsort  Memory: 172kB
         Buffers: shared hit=3671
         ->  Nested Loop  (cost=2243.47..6689.35 rows=740 width=1388) (actual time=3.050..8.745 rows=615 loops=1)
               Buffers: shared hit=3668
               ->  HashAggregate  (cost=2242.90..2252.54 rows=964 width=4) (actual time=3.020..3.141 rows=622 loops=1)
                     Group Key: ci_build_trace_chunks.build_id
                     Buffers: shared hit=555
                     ->  Seq Scan on public.ci_build_trace_chunks  (cost=0.00..2238.32 rows=1832 width=4) (actual time=0.008..2.595 rows=1179 loops=1)
                           Buffers: shared hit=555
               ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.57..4.59 rows=1 width=1388) (actual time=0.008..0.009 rows=1 loops=622)
                     Index Cond: (ci_builds.id = ci_build_trace_chunks.build_id)
                     Filter: ((ci_builds.finished_at < '2019-08-01 15:46:08.23754'::timestamp without time zone) AND ((ci_builds.type)::text = 'Ci::Build'::text))
                     Rows Removed by Filter: 0
                     Buffers: shared hit=3113
 Planning time: 7.061 ms
 Execution time: 9.619 ms
 Total Cost: 6719.48
 Buffers Hit: 3671
 Buffers Written: 0
 Buffers Read: 0

Does this MR meet the acceptance criteria?

Conformity

Performance and testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Shinya Maeda

Merge request reports