Clean up Ci::ArchiveTracesCronWorker does not scale for big backlogs

On 2020-12-14 there was a GCP outage that caused object storage to become inaccessible.

During this time, we (@cmiskell) noticed an increase in memory usage on redis-persistent

image

dashboard

Looking at what was writing into redis, we noticed this was mostly coming from the trace endpoint:

image

log visualization

Normally, when a CI job finishes, we schedule an Ci::ArchiveTraceWorker to store these traces in object storage, and remove them from redis. However, since object storage was unavailable, this job failed during that time:

image

link to raw logs

The error is swallowed by the Ci::ArchiveTraceService which caused these failed jobs not to be retried with backoff later.

We have the Ci::ArchiveTracesCronWorker that is supposed to deal with any leftover traces. But that worker iterates over all of the jobs with leftover traces in one run. This means that the worker hardly ever finishes:

image log visualization

At the time of writing, there where ~26000 builds with "stale traces".

gitlabhq_production=> SELECT COUNT("ci_builds".*) FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND (EXISTS (SELECT 1 FROM "ci_build_trace_chunks" WHERE (ci_builds.id = ci_build_trace_chunks.build_id))) AND ("ci_builds"."status" IN ('success','failed','canceled')) AND ci_builds.finished_at < '2020-12-16 00:00:00';
 count 
-------
 26811
(1 row)

Most of these seem to be from around the time of the outage:

gitlabhq_production=> SELECT COUNT("ci_builds".*) FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND (EXISTS (SELECT 1 FROM "ci_build_trace_chunks" WHERE (ci_builds.id = ci_build_trace_chunks.build_id))) AND ("ci_builds"."status" IN ('success','failed','canceled')) AND ci_builds.finished_at BETWEEN '2020-12-14 11:00:00' AND '2020-12-14 13:00:00';
 count 
-------
 26004
(1 row)

Proposal

  1. Let the Ci::ArchiveTraceService raise it's errors so the jobs can be retried. We could also raise the number of times this job is retried so it backs off long enough. Looking at the table in the Sidekiq documentation setting the number of retries to 8 would have make all jobs outlast this particular GCP outage.

  2. Make the Ci::ArchiveTracesCronWorker not process all rows inside a single worker. Perhaps we could use another LimitedCapacity::Worker for this? Or we could use that cron worker to schedule more ArchiveTraceWorker jobs? In either case, we'll need to keep track of the state of the archiving somehow.