feat(datastore): add database row count metrics with distributed locking (!2432) · Merge requests · GitLab.org / container-registry

What does this MR do?

Overview

This MR introduces an experimental database row count metrics collection system. This feature provides visibility into database table sizes through Prometheus metrics, helping with monitoring, capacity planning, and debugging.

Key Features

Distributed Collection: Uses Redis-based distributed locking to ensure only one registry instance in a cluster collects metrics at a time. Adheres to https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/ and uses https://github.com/bsm/redislock as suggested by https://github.com/redis/go-redis?tab=readme-ov-file#ecosystem.
Configurable Intervals: Supports custom collection intervals and lock durations via configuration
Thread-Safe Design: Concurrent collection with proper mutex handling
Graceful Shutdown: Clean startup/shutdown with proper resource cleanup
Performance Monitoring: New Prometheus metric registry_database_row_count_collection_duration_seconds tracks collection performance

Covered Row Counts

The first query implemented tracks gc_blob_review_queue table row count. Additional queries will be added in separate MRs after successful rollout.

Configuration

This feature depends and reuses the existing redis.cache connection for distributed locking (no dedicated Redis instance needed initially).

database:
  metrics:
    enabled: true      # Enable metrics collection
    interval: 10s      # Collection frequency  
    leaseduration: 30s # Redis lock duration

Failover Behavior

The collector uses a single-attempt locking strategy rather than continuous retry loops. When an instance fails to acquire the Redis lock at startup, it logs the event and exits the collection process entirely. This design choice:

Reduces complexity and resource usage (no busy waiting)
Provides clear single leadership with predictable behavior
Relies on cluster scaling events (deployments, restarts, auto-scaling) for failover
Aligns with container orchestration patterns where failed services are automatically restarted

If automatic retry behavior proves necessary in production, this can be revisited in a future iteration.

Testing Locally

To test the distributed locking behavior:

Start first instance:
```
./bin/registry serve config.yml
```

Start second instance (different ports):

REGISTRY_HTTP_ADDR=:5052 REGISTRY_HTTP_DEBUG_ADDR=:5053 ./bin/registry serve config.yml

Expected log patterns:

First instance (acquires lock):

{"level":"info","msg":"database row count metrics collection started","interval_s":10,"lease_duration_s":30}
{"level":"info","msg":"obtained database row count metrics lock"}
{"level":"info","msg":"database row count metric collected","query_name":"gc_blob_review_queue","count":0}
{"level":"info","msg":"extended database row count metrics lock"}

Second instance (lock denied):

{"level":"info","msg":"database row count metrics collection started","interval_s":10,"lease_duration_s":30}
{"level":"info","msg":"database row count metrics lock already obtained by another instance"}

Test failover: Stop the first instance and restart the second instance. The second instance should then acquire the lock and begin collecting metrics.

Prometheus Metrics

The feature exposes these new metrics at /metrics:

registry_database_rows{query_name="gc_blob_review_queue"} - Current row count
registry_database_row_count_collection_duration_seconds - Histogram of collection durations

Next Steps

After successful rollout of this foundational implementation:

Add more table queries (repositories, manifests, blobs, etc.)
Consider dedicated Redis instance for metrics if needed
Performance tuning based on production metrics

Author checklist

Documentation/resources

Code review guidelines

Go Style guidelines

Reviewer checklist

Ensure the commit and MR tittle are still accurate.
If the change contains a breaking change, verify the breaking change label.
If the change is considered high risk, verify the label high-risk-change
Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).

Edited Aug 07, 2025 by João Pereira

feat(datastore): add database row count metrics with distributed locking