feat(datastore): add database row count metrics with distributed locking
What does this MR do?
Related to Adjust online GC row count metrics (#1250 - closed).
Overview
This MR introduces an experimental database row count metrics collection system. This feature provides visibility into database table sizes through Prometheus metrics, helping with monitoring, capacity planning, and debugging.
Key Features
- Distributed Collection: Uses Redis-based distributed locking to ensure only one registry instance in a cluster collects metrics at a time. Adheres to https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/ and uses https://github.com/bsm/redislock as suggested by https://github.com/redis/go-redis?tab=readme-ov-file#ecosystem.
- Configurable Intervals: Supports custom collection intervals and lock durations via configuration
- Thread-Safe Design: Concurrent collection with proper mutex handling
- Graceful Shutdown: Clean startup/shutdown with proper resource cleanup
-
Performance Monitoring: New Prometheus metric
registry_database_row_count_collection_duration_seconds
tracks collection performance
Covered Row Counts
The first query implemented tracks gc_blob_review_queue
table row count. Additional queries will be added in separate MRs after successful rollout.
Configuration
This feature depends and reuses the existing redis.cache
connection for distributed locking (no dedicated Redis instance needed initially).
database:
metrics:
enabled: true # Enable metrics collection
interval: 10s # Collection frequency
leaseduration: 30s # Redis lock duration
Failover Behavior
The collector uses a single-attempt locking strategy rather than continuous retry loops. When an instance fails to acquire the Redis lock at startup, it logs the event and exits the collection process entirely. This design choice:
- Reduces complexity and resource usage (no busy waiting)
- Provides clear single leadership with predictable behavior
- Relies on cluster scaling events (deployments, restarts, auto-scaling) for failover
- Aligns with container orchestration patterns where failed services are automatically restarted
If automatic retry behavior proves necessary in production, this can be revisited in a future iteration.
Testing Locally
To test the distributed locking behavior:
-
Start first instance:
./bin/registry serve config.yml
-
Start second instance (different ports):
REGISTRY_HTTP_ADDR=:5052 REGISTRY_HTTP_DEBUG_ADDR=:5053 ./bin/registry serve config.yml
-
Expected log patterns:
First instance (acquires lock):
{"level":"info","msg":"database row count metrics collection started","interval_s":10,"lease_duration_s":30} {"level":"info","msg":"obtained database row count metrics lock"} {"level":"info","msg":"database row count metric collected","query_name":"gc_blob_review_queue","count":0} {"level":"info","msg":"extended database row count metrics lock"}
Second instance (lock denied):
{"level":"info","msg":"database row count metrics collection started","interval_s":10,"lease_duration_s":30} {"level":"info","msg":"database row count metrics lock already obtained by another instance"}
-
Test failover: Stop the first instance and restart the second instance. The second instance should then acquire the lock and begin collecting metrics.
Prometheus Metrics
The feature exposes these new metrics at /metrics
:
-
registry_database_rows{query_name="gc_blob_review_queue"}
- Current row count -
registry_database_row_count_collection_duration_seconds
- Histogram of collection durations
Next Steps
After successful rollout of this foundational implementation:
- Add more table queries (repositories, manifests, blobs, etc.)
- Consider dedicated Redis instance for metrics if needed
- Performance tuning based on production metrics
Author checklist
- Assign one of conventional-commit prefixes to the MR.
-
fix
: Indicates a bug fix, triggers a patch release. -
feat
: Signals the introduction of a new feature, triggers a minor release. -
perf
: Focuses on performance improvements that don't introduce new features or fix bugs, triggers a patch release. -
docs
: Updates or changes to documentation. Does not trigger a release. -
style
: Changes that do not affect the code's functionality. Does not trigger a release. -
refactor
: Modifications to the code that do not fix bugs or add features but improve code structure or readability. Does not trigger a release. -
test
: Changes related to adding or modifying tests. Does not trigger a release. -
chore
: Routine tasks that don't affect the application, such as updating build processes, package manager configs, etc. Does not trigger a release. -
build
: Changes that affect the build system or external dependencies. May trigger a release. -
ci
: Modifications to continuous integration configuration files and scripts. Does not trigger a release. -
revert
: Reverts a previous commit. It could result in a patch, minor, or major release.
-
-
Feature flags
-
This change does not require a feature flag -
Added feature flag: ( Add the Feature flag tracking issue link here )
-
- Unit-tests
-
Unit-tests are not required -
I added unit tests
-
- Documentation:
-
Documentation is not required -
I added documentation -
I created or linked to an existing issue for every added or updated TODO
,BUG
,FIXME
orOPTIMIZE
prefixed comment
-
-
database changes including schema/background migrations:
-
Change does not introduce database changes - MR includes DB chagnes
- Do not include code that depends on the schema migrations in the same commit. Split the MR into two or more.
- Do not include code that depends on background migrations in the same release.
-
Manually run up and down migrations in a postgres.ai production database clone and post a screenshot of the result here. -
If adding new schema migrations make sure the REGISTRY_SELF_MANAGED_RELEASE_VERSION
CI variable in migrate.yml is pointing to the latest GitLab self-managed released registry version. Find the correct registry version here. Make sure to select the branch of the latest GitLab release. -
If adding new queries, extract a query plan from postgres.ai and post the link here. If changing existing queries, also extract a query plan for the current version for comparison. -
I do not have access to postgres.ai and have made a comment on this MR asking for these to be run on my behalf.
-
-
If adding new background migration, follow the guide for performance testing new background migrations and add a report/summary to the MR with your analysis.
-
-
Ensured this change is safe to deploy to individual stages in the same environment ( cny
->prod
). State-related changes can be troublesome due to having parts of the fleet processing (possibly related) requests in different ways. -
If the change contains a breaking change, apply the breaking change label. -
If the change is considered high risk, apply the label high-risk-change - Changes cannot be rolled back
-
Change can be safelly rolled back - Change can't be safelly rolled back
-
Apply the label cannot-rollback. -
Add a section to the MR description that includes the following details: -
The reasoning behind why a release containing the presented MR can not be rolled back (e.g. schema migrations or changes to the FS structure) -
Detailed steps to revert/disable a feature introduced by the same change where a migration cannot be rolled back. (note: ideally MRs containing schema migrations should not contain feature changes.) -
Ensure this MR does not add code that depends on these changes that cannot be rolled back.
-
-
-
Reviewer checklist
-
Ensure the commit and MR tittle are still accurate. -
If the change contains a breaking change, verify the breaking change label. -
If the change is considered high risk, verify the label high-risk-change -
Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).