Clean up Ci::ArchiveTracesCronWorker does not scale for big backlogs
On 2020-12-14 there was a GCP outage that caused object storage to become inaccessible.
During this time, we (@cmiskell) noticed an increase in memory usage on redis-persistent
Looking at what was writing into redis, we noticed this was mostly coming from the trace endpoint:
Normally, when a CI job finishes, we schedule an Ci::ArchiveTraceWorker to store these traces in object storage, and remove them from redis. However, since object storage was unavailable, this job failed during that time:
The error is swallowed by the Ci::ArchiveTraceService which caused these failed jobs not to be retried with backoff later.
We have the Ci::ArchiveTracesCronWorker that is supposed to deal with any leftover traces. But that worker iterates over all of the jobs with leftover traces in one run. This means that the worker hardly ever finishes:
At the time of writing, there where ~26000 builds with "stale traces".
gitlabhq_production=> SELECT COUNT("ci_builds".*) FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND (EXISTS (SELECT 1 FROM "ci_build_trace_chunks" WHERE (ci_builds.id = ci_build_trace_chunks.build_id))) AND ("ci_builds"."status" IN ('success','failed','canceled')) AND ci_builds.finished_at < '2020-12-16 00:00:00';
count
-------
26811
(1 row)
Most of these seem to be from around the time of the outage:
gitlabhq_production=> SELECT COUNT("ci_builds".*) FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND (EXISTS (SELECT 1 FROM "ci_build_trace_chunks" WHERE (ci_builds.id = ci_build_trace_chunks.build_id))) AND ("ci_builds"."status" IN ('success','failed','canceled')) AND ci_builds.finished_at BETWEEN '2020-12-14 11:00:00' AND '2020-12-14 13:00:00';
count
-------
26004
(1 row)
Proposal
-
Let the
Ci::ArchiveTraceServiceraise it's errors so the jobs can be retried. We could also raise the number of times this job is retried so it backs off long enough. Looking at the table in the Sidekiq documentation setting the number of retries to 8 would have make all jobs outlast this particular GCP outage. -
Make the
Ci::ArchiveTracesCronWorkernot process all rows inside a single worker. Perhaps we could use anotherLimitedCapacity::Workerfor this? Or we could use that cron worker to schedule moreArchiveTraceWorkerjobs? In either case, we'll need to keep track of the state of the archiving somehow.



