Skip to content

Introduce Ci::UpdateLockedUnknownArtifactsWorker

drew stachon requested to merge new-worker-for-artifacts-backlog into master

What does this MR do and why?

This merge request introduces a new worker and service specifically to slowly work through the backlog of ci_job_artifact records that are expired and unlocked in the ci_pipelines table, but are locked = unknown in the ci_job_artifacts table.

We have a separate worker and service to handle the removal of job artifact records that are directly marked as unlocked. So this worker only queries expired artifacts where locked = unknown, and uses the older, slower query to mark them as locked (to be unlocked by some future event) or discover that they are unlocked and remove them on the spot.

The backlog of unknown-status records does not grow so long as the ci_update_unlocked_job_artifacts feature flag is enabled, which has been the case most of the time for the past few months, and is the case right now. Under these conditions, we expect this worker to eventually get through all artifacts with an unknown locked status and start reporting 0s. At this point, with the ci_update_unlocked_job_artifacts flag removed and the new code path in production, we can delete this worker.

We introduce two feature flags to control the operation of this worker:

  • ci_job_artifacts_backlog_work: a flag placed specifically for turning on and off the section of code that queries expired but not-yet-removed artifacts.
  • ci_job_artifacts_backlog_large_loop_limit: a flag allowing us to start at a maximum of 10,000 (10% of prior limit) expirations per worker execution, and scale up to a 50,000 maximum (50% of prior limit) if we find the performance to be acceptable.

How to set up and validate locally

The work being done here cannot be fully validated locally. I rebuilt the queries and the application logic that sends them with the advice and consent of @msmiley who was very helpful in explaining what was bad before and how to avoid creating similar inefficiency.

Query execution summary

1. Fetch all Build ids from ci_job_artifact rows with locked: 2 (unknown)

# Ruby
unknown_status_build_ids = safely_ordered_ci_job_artifacts_locked_unknown_relation.pluck_job_id.uniq

def safely_ordered_ci_job_artifacts_locked_unknown_relation
  Ci::JobArtifact.expired_before(@start_at).artifact_unknown.limit(BATCH_SIZE).order_expired_asc
end
-- SQL
SELECT "ci_job_artifacts"."job_id" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2022-03-16 21:24:46.641892' AND "ci_job_artifacts"."locked" = 2 ORDER BY "ci_job_artifacts"."expire_at" ASC LIMIT 100

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9287/commands/32814

This query, wrapped up in #safely_ordered_ci_job_artifacts_locked_unknown_relation, is used consistently throughout the service to ensure we make the same well-understood queries every time.

2. Select all build ids from unknown-locked-status list that are marked as locked in the ci_pipelines table

# Ruby
locked_pipeline_build_ids = ::Ci::Build.with_pipeline_locked_artifacts.id_in(unknown_status_build_ids).pluck_primary_key
-- SQL
SELECT "ci_builds"."id" FROM "ci_builds" INNER JOIN "ci_pipelines" "pipeline" ON "pipeline"."id" = "ci_builds"."commit_id" WHERE "ci_builds"."type" = 'Ci::Build' AND "pipeline"."locked" = 1 AND "ci_builds"."id" IN (4, 3)

Query Plan: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/9271/commands/32773

3. Update job artifacts to be locked in their own attribute as well as in the pipelines table

From a earlier version of this merge request, we removed the ORDER and LIMIT clause from the UPDATE query to improve performance.

expired_locked_unknown_artifacts.for_job_ids(build_ids).update_all(locked: locked_value)

I grabbed 100 build ids from Sisense to try to get a sense of the largest queries needing to be sent at one time. We'll never fetch more than 100 build ids at once.

UPDATE "ci_job_artifacts" SET "locked" = 1 WHERE "ci_job_artifacts"."id" IN (SELECT "ci_job_artifacts"."id" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2022-03-28 21:27:58.758459' AND "ci_job_artifacts"."locked" = 2 AND "ci_job_artifacts"."job_id" IN (2245335633,2245400669,2245355408,2245383733,2245429296,2245383705,2245291765,2245390864,2245383675,2245425944,2245426605,2245427956,2245379522,2245418110,2245428076,2245426427,2245399314,2245346201,2245418713,2245389227,2245391538,2245413515,2244897250,2245288103,2245383747,2245390862,2245414005,2242462052,2245383724,2245426438,2245192017,2245428446,2245383738,2245418455,2245413453,2245253692,2245430018,2245427087,2245212870,2245427300,2245428441,2245429641,2245426646,2245412098,2245383744,2245419267,2245426629,2245376735,2245160023,2245386784,2245423395,2245410776,2245387496,2245378629,2245294692,2244883987,2245316736,2245383730,2245429499,2245330785,2245431075,2245416367,2245382159,2245383712,2244923173,2245374348,2245366942,2245411787,2245429106,2245429112,2245399702,2245413951,2245400754,2245329055,2245383569,2245429165,2245414796,2245318365,2245292724,2245416886,2245427291,2245419109,2245406234,2245420603,2245196618,2245421345,2245402192,2245421043,2245409256,2245425861,2245395526,2245427104,2245429500,2245422726,2245384296,2245283787,2245430945,2245385628,2245421930,2245428434))

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9463/commands/33561 (an improvement from the original query plan)

For monitoring this query once we turn the worker on, I'll be watching charts in Thanos of the the number of dead tuples on ci_job_artifacts to make sure we stay under 1m. Spikes into the hundreds of thousands are normal, and can be easily handled by our current autovacuum configuration. I'll also be monitoring the number of tuple reads on each of the table indexes to make sure we're not degrading performance elsewhere on the table.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by drew stachon

Merge request reports