Skip to content

Optimize CI::JobArtifact querying from ExpireBuildArtifactsWorker

drew stachon requested to merge throttle-artifact-expiration into master

What does this MR do and why?

This MR introduces three optimizations in, and extra feature flags for more granular control of, the ExpireBuildArtifactsWorker. In general, I recommend reading this MR commit by commit.

The first two commits introduce new feature flags that will allow us to re-enable the worker safely. At first, all the flags will be off.

  • ci_job_artifacts_backlog_work: a flag placed specifically for turning on and off the section of code that queries expired but not-yet-removed artifacts.
  • ci_job_artifacts_large_loop_limit: a flag allowing us to start at a maximum of 10,000 (10% of prior limit) expirations per worker execution, and scale up to a 50,000 maximum (50% of prior limit) if we find the performance to be acceptable.

For the query optimizations, see the long commit messages on the first and second commits that introduce them. The descriptions there are good, and the diffs will be far more comprehensible.

By introducing these flags, we can re-enable ci_destroy_unlocked_job_artifacts and stop the accumulation of new job artifacts that cannot be expired via the older, less performant three-table-join process. Separating this improvement from the backlog-focused changes is a useful improvement over our first attempt.

How to set up and validate locally

The work being done here cannot be fully validated locally. I rebuilt the queries and the application logic that sends them with the advice and consent of @msmiley who was very helpful in explaining what was bad before and how to avoid creating similar inefficiency.

Once we've restarted the ExpireBuildArtifactsWorker and are expiring newly created artifacts efficiently going forward (we were doing this before the incident), we'll start working on the backlog again. We have two feature flags that we'll use in tandem to roll out this change in production:

  • ci_job_artifacts_backlog_work Enabled the code that queries the artifacts backlog, i.e. those ci_job_artifact records where ci_job_artifacts.locked = unknown.

  • ci_job_artifacts_large_loop_limit Changes the loop limit that caps the number of records our worker will touch in a single execution from 10,000 to 50,000.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

  • I have evaluated the MR acceptance checklist for this MR.
  • Maintainer: Please coordinate synchronously with @drew about merging this MR. It should be a carefully monitored deployment.

Original commit log:

8dbb38a4

Add order and limit to JobArtifact query

3ead7cab

Add reusable query to DestroyAllExpiredService

Reuse the same query to ensure that we apply ORDER and LIMIT

Before:
UPDATE "ci_job_artifacts" SET "locked" = 2
  WHERE "ci_job_artifacts"."job_id" IN (1, 2, 3)

After:
UPDATE "ci_job_artifacts" SET "locked" = 2
  WHERE "ci_job_artifacts"."id"
    IN (
      SELECT "ci_job_artifacts"."id" FROM "ci_job_artifacts"
        WHERE "ci_job_artifacts"."job_id"
          IN (1, 2, 3)
        LIMIT 100
    )

438860cc

Granular logging in ExpireBuildArtifactsWorker

fa20e02f

Destroy unlocked job artifacts without updating


By updating rows in ci_job_artifacts to then be deleted on the next
loop cycle, we create twice as many dead tuples as necessary in the
table. Artifacts locked by ci_pipelines.locked will be updated and
left in place, but the others will get passed directly into the
DestroyBatchService to be removed. The dead tuple created by the
update is not necessary.

8c4ac42d

Optimize Ci::JobArtifact query


First, we need to remove the DISTINCT clause from this query. Using
it tried to distinct the whole table before applying the limit, which
is bad. Also, using DISTINCT implicitly queried for the lowest job_id
values first, which created an enormous pile of dead tuple at the end
of the index that we were explicitly starting our query from. This is
catastrophic for index performance on replicas, but not on the
primary, which can use LP_DEAD hint to skip dead index tuples. On
replicas, performance remains degraded until we VACUUM.

Second, we apply an explicit order by expiration to prevent us from
accidentally querying rows in an order that correlates with job_id.
By ensuring that we query rows that are generally well-distributed
across the table, we prevent large blocks of dead index tuples that
need to be scanned before we can come up with 100 live tuples to
return.
Edited by drew stachon

Merge request reports