Skip to content

Restore optimized newly expired artifacts removal

drew stachon requested to merge expired-artifact-fast-removal into master

What does this MR do and why?

This merge request restores the performant Ci::JobArtifact removal process that was introduced in !72406 (merged). This code is set up to remove expired CI::JobArtifact records without referencing any other tables, and is therefore able to remove them at a much faster pace.

Prior to gitlab-com/gl-infra/production#5952 (closed), we were expiring artifacts at a rate of 1-3k per execution. So in this merge request, I've added a feature flag ci_artifact_fast_removal_large_loop_limit that switches the record from 1_000 to 10_000 records per execution of the worker.

In practice, these mean:

  • 1_000: Continue expiring artifacts in a performant fashion, but slow down to a value we know to be well within safe operational bounds.
  • 10_000: Expire all newly created artifacts as they expire, but we're setting a speed limit at 10,000 records that we don't expect to hit. If we see a big influx and decide we need to slow down to vacuum the table, this will prevent dead tuples from accumulating at an unanticipated rate.

Note: Since the worker has been turned off for a few weeks, we expect to see the 10_000 record limit hit in the first executions of the worker. This should drop off and return to 1-3k per execution.

How to set up and validate locally

The behavior we want is laid out in the modified spec files. Read and run those to see what we intend the behaviors to be.

Unfortunately the real spirit of what we're achieving here is not demonstrable outside of a production environment, since it involves throttling the write speed to the production ci_job_artifacts table.

Queries and plans

Here's the full list of queries made by an execution of this worker, with a query plan below for the query that we run on a loop continuously to load records from ci_job_artifacts. For context, we did this in a sustainable way before and we don't expect performance issues from this code.

SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:58:18.560624' AND "ci_job_artifacts"."locked" = 0 LIMIT 100

SELECT "projects"."id", "projects"."name", "projects"."path", "projects"."description", "projects"."created_at", "projects"."updated_at", "projects"."creator_id", "projects"."namespace_id", "projects"."last_activity_at", "projects"."import_url", "projects"."visibility_level", "projects"."archived", "projects"."avatar", "projects"."merge_requests_template", "projects"."star_count", "projects"."merge_requests_rebase_enabled", "projects"."import_type", "projects"."import_source", "projects"."approvals_before_merge", "projects"."reset_approvals_on_push", "projects"."merge_requests_ff_only_enabled", "projects"."issues_template", "projects"."mirror", "projects"."mirror_user_id", "projects"."shared_runners_enabled", "projects"."runners_token", "projects"."build_coverage_regex", "projects"."build_allow_git_fetch", "projects"."build_timeout", "projects"."mirror_trigger_builds", "projects"."pending_delete", "projects"."public_builds", "projects"."last_repository_check_failed", "projects"."last_repository_check_at", "projects"."only_allow_merge_if_pipeline_succeeds", "projects"."has_external_issue_tracker", "projects"."repository_storage", "projects"."repository_read_only", "projects"."request_access_enabled", "projects"."has_external_wiki", "projects"."ci_config_path", "projects"."lfs_enabled", "projects"."description_html", "projects"."only_allow_merge_if_all_discussions_are_resolved", "projects"."repository_size_limit", "projects"."printing_merge_request_link_enabled", "projects"."auto_cancel_pending_pipelines", "projects"."service_desk_enabled", "projects"."cached_markdown_version", "projects"."delete_error", "projects"."last_repository_updated_at", "projects"."disable_overriding_approvers_per_merge_request", "projects"."storage_version", "projects"."resolve_outdated_diff_discussions", "projects"."remote_mirror_available_overridden", "projects"."only_mirror_protected_branches", "projects"."pull_mirror_available_overridden", "projects"."jobs_cache_index", "projects"."external_authorization_classification_label", "projects"."mirror_overwrites_diverged_branches", "projects"."pages_https_only", "projects"."external_webhook_token", "projects"."packages_enabled", "projects"."merge_requests_author_approval", "projects"."pool_repository_id", "projects"."runners_token_encrypted", "projects"."bfg_object_map", "projects"."detected_repository_languages", "projects"."merge_requests_disable_committers_approval", "projects"."require_password_to_approve", "projects"."emails_disabled", "projects"."max_pages_size", "projects"."max_artifacts_size", "projects"."remove_source_branch_after_merge", "projects"."marked_for_deletion_at", "projects"."marked_for_deletion_by_user_id", "projects"."autoclose_referenced_issues", "projects"."suggestion_commit_message", "projects"."project_namespace_id" FROM "projects" WHERE "projects"."id" = 78

SELECT "routes".* FROM "routes" WHERE "routes"."source_type" = 'Project' AND "routes"."source_id" = 78 

SELECT "project_statistics".* FROM "project_statistics" WHERE "project_statistics"."project_id" = 78

DELETE FROM "security_findings" WHERE "security_findings"."id" IN (SELECT "security_findings"."id" FROM "security_findings" INNER JOIN "security_scans" ON "security_scans"."id" = "security_findings"."scan_id" WHERE "security_scans"."build_id" = 108)

SAVEPOINT active_record_2

INSERT INTO "ci_deleted_objects" ("file_store","store_dir","file","pick_up_at") VALUES (1, '34/9c/349c41201b62db851192665c504b350ff98c6b45fb62a8a2161f78b6534d8de9/2021_12_10/108/1190', 'ci_build_artifacts.zip', '2021-12-09 00:00:00') ON CONFLICT  DO NOTHING RETURNING "id"

DELETE FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."id" = 1190

RELEASE SAVEPOINT active_record_2

SELECT 1 AS one FROM "geo_nodes" LIMIT 1

Is one iteration, when there are artifacts to be removed that have associated files in Object Storage. After that, we start from the top with:

SELECT "ci_job_artifacts"."project_id", ... WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:58:18.560624' AND "ci_job_artifacts"."locked" = 0 LIMIT 100

either 9 or 99 more times.

Fetching expired unlocked artifacts

Once per loop iteration, we execute the following query. This means it will fire either 10 or 100 times per worker execution, every seven minutes.

SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:43:56.582765' AND "ci_job_artifacts"."locked" = 0 LIMIT 100

Execution plan:

 Limit  (cost=0.57..103.82 rows=100 width=140) (actual time=6.241..56.041 rows=100 loops=1)
   Buffers: shared hit=4 read=40 dirtied=5
   I/O Timings: read=54.185 write=0.000
   ->  Index Scan using ci_job_artifacts_expire_at_unlocked_idx on public.ci_job_artifacts  (cost=0.57..27576730.60 rows=26708556 width=140) (actual time=6.239..56.013 rows=100 loops=1)
         Index Cond: (ci_job_artifacts.expire_at < '2021-12-10 02:43:56.582765+00'::timestamp with time zone)
         Buffers: shared hit=4 read=40 dirtied=5
         I/O Timings: read=54.185 write=0.000

Time: 60.483 ms
  - planning: 4.370 ms
  - execution: 56.113 ms
    - I/O read: 54.185 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 4 (~32.00 KiB) from the buffer pool
  - reads: 40 (~320.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 5 (~40.00 KiB)
  - writes: 0

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Deployment:

  1. Merge this code, deploy to production
  2. Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!1393 (merged) and deploy to production workers
  3. turn https://gitlab.com/gitlab-org/gitlab/-/issues/348786 on to re-enable the DestroyAllExpiredService #348786 (closed)
  4. Turn ci_destroy_unlocked_job_artifacts on to use the more efficient expiration logic on ci_job_artifact records with known locked values.
  5. Observe worker execution with SMALL_LOOP_LIMIT: Kibana, Thanos
  6. Turn on ci_artifact_fast_removal_large_loop_limit Rollout: #347151 (closed)
  7. Observe worker execution with LARGE_LOOP_LIMIT Kibana, Thanos
  8. Move on to merging and deploying !76509 (merged)

Note: The Thanos links are to the incident time window for reference. To observe current behavior, update the time window.

Edited by drew stachon

Merge request reports