Skip to content

Track event type that led to queueing online GC tasks

Context

Online GC, as described in the specification, operates on top of two database tables that act as review queues: gc_manifest_review_queue and gc_blob_review_queue, one for manifests and another for blobs, respectively.

There are multiple API events that may lead to dangling artifacts (manifests and blobs), and therefore a review task is queued in response to each of these events.

For example, a manifest may become dangling when a tag is deleted through the API. In this case, a task is queued to ensure that the manifest that the tag was pointing to still have at least another tag (or another manifest) referencing it, otherwise it should be deleted.

Problem

Right now we don't have visibility over which event led to a task being queued. Knowing which event led to queueing a given task (and the potential artifact deletion later) can facilitate debugging/analysis but also allows us to collect additional metrics.

Solution

  1. Add a new event (text) column to the GC queue tables (release N).
  2. Update GC trigger functions (on the database) to fill this column when inserting (or updating on conflict*) rows in these tables (release N).
  3. Update GC workers (on the application) to log the type of event for each processed task (release N+1).
  4. Update integration tests in registry/datastore/gc_integration_test.go so that the value of the new event column is properly validated.
  5. Update registry_gc_runs_total Prometheus metric to include the event type of each task (release N+1). There are only 7 types of events (more on that later), so cardinality should not be a problem.
  6. Add NOT NULL constraint to new event column and update models accordingly.
  7. Expand Grafana dashboards to include metrics about the new dangling and event labels.

* As described in the specification, some GC triggers have an ON CONFLICT DO UPDATE clause. This is needed because different events may lead to multiple attempts to queue a task for the exact same manifest or blob. Therefore, we will also update the event of existing tasks in case of conflict. This guarantees that the value of event is the latest event that led to queueing a task and not the first one.

Events

Manifests

Here we'll list all GC functions/triggers that are responsible for inserting/updating rows on the gc_manifest_review_queue table (all documented in the specification), as well as the corresponding value to be used for the new event column:

DB Function Triggering Event Identifier
gc_track_manifest_uploads manifest_upload
gc_track_deleted_manifest_lists manifest_list_delete
gc_track_deleted_tags tag_delete
gc_track_switched_tags tag_switch

Blobs

Same but for the gc_blob_review_queue table:

DB Function Triggering Event Identifier
gc_track_blob_uploads blob_upload
gc_track_deleted_manifests manifest_delete
gc_track_deleted_layers layer_delete
Edited by João Pereira