Ci::PipelineSuccessUnlockArtifactsWorker times out updating ci_job_artifacts table

From https://log.gprd.gitlab.net/goto/609ba7a0-55c7-11ed-8d37-e9a2f393ea2a, we see that Ci::PipelineSuccessUnlockArtifactsWorker occasionally hits a statement timeout updating the ci_job_artifacts table. This happened 82 times in the past week:

The SQL query:

UPDATE "ci_job_artifacts" SET "locked" = $1 WHERE "ci_job_artifacts"."job_id" IN (SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."type" = $2 AND "ci_builds"."commit_id" IN ($3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102)) RETURNING ("ci_job_artifacts"."id")

As seen above, this query updates job artifact rows that belong to 102 pipelines (ci_build.commit_id). On large project with a lot of jobs and job artifacts such as gitlab-org/gitlab, this could mean updating more than 10000 rows in one go.

This query runs in the following transaction.

        ::Ci::Pipeline.transaction do
          unlocked_pipelines = unlock_pipelines(ci_ref, before_pipeline)
          unlocked_job_artifacts = unlock_job_artifacts(unlocked_pipelines) # query that times out

          results[:unlocked_pipeline_artifacts] += unlock_pipeline_artifacts(unlocked_pipelines)
        end

As a result of the timeout, the transaction would be rolled back.

Proposal

Instead of unlocking all pipelines that need to be unlocked and all job artifacts and all pipeline artifacts belonging to those pipelines in a single transaction, we could distribute the workload to unlocking individual pipeline by putting them into a queue.

The UnlockArtifactsService where everything currently happens will change to only select pipeline IDs that need to be unlocked and place these pipeline IDs into a queue.

A worker UnlockPipelineAndArtifactsWorker will pick up each pipeline ID individually and proceed to unlock the pipeline, its job artifacts and its pipeline artifacts.

This would result in smaller queries that operate on a single pipeline and a small subset of job artifacts and pipeline artifacts each time.

Proposed implementation

Refer to #406255 (closed)

Edited Apr 12, 2023 by Albert