Expose row count of some container registry database tables to Prometheus
Context
The online garbage collector of the container registry makes use of two tables to queue and process tasks. These tables are named gc_manifest_review_queue
and gc_blob_review_queue
.
It's important to monitor the size of these tables as that tells us how many tasks are queued and the worker saturation. Right now, we're doing this monitoring on the application side by periodically running a SELECT count(*) FROM <table>
query against the database and then exporting it to Prometheus as shown in the following sample:
# HELP registry_gc_queue_size The size of online GC review queues.
# TYPE registry_gc_queue_size gauge
registry_gc_queue_size{queue="gc_manifest_review_queue"} 2
registry_gc_queue_size{queue="gc_blob_review_queue"} 50
Problem
Ideally, we'd like to run this query outside of the application. Doing this on the application side means running this query on every instance. This is a waste of resources for a large cluster as every instance will query for the same thing. Although we run these interleaved with a randomized jitter and in a 10m cadence to ensure we don't put too much stress on the DB, it's completely unnecessary to do so. Plus, we have multiple instances reporting different row counts, which is far from accurate.
Solution
Export a queue size metric for each one of these tables from the server (DB cluster) side. If possible, it would be good to keep the same metric name and type.
If we can have the exact row count (SELECT count(*) FROM <table>
) for these tables that would be good, but it's also fine if we can only get an approximation (e.g., reltuples).
If exporting these metrics is cheap enough, it would also be good to know the size of a few other tables, namely:
repositories
manifests
blobs
It is important to note that the tables above, unlike the gc_*_queue
, are partitioned (64). So a global row count might be challenging/undesirable for performance reasons. If this can be done, it might be better to use a more generic name for the gauge, e.g. registry_database_row_count{table="<name>"}
. We can then expose these in Grafana.