Record latency attribution in container registry logs

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

The container registry logs currently do not provide a lot of additional information besides the core request metadata. In particular, there is little information about where the request spent its time.

This makes it hard to attribute apdex drops. As the latency increase could be due to a number of factors:

Database
Redis
GCS
CPU overload
Lock contention
etc.

This was recently surfaced as part of gitlab-com/gl-infra/production#8376 (closed). There was no easy way to tie the slowdown to an underlying database issue. SREs and devs need to guess as to what the source of the latency might be.

Most of our other applications (most notably rails and gitaly) have rich latency attribution data in their logs. A recent example from gitaly is https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2167.

Ideally we would have log fields for each external call contributing to latency:

database_duration_s
redis_duration_s
gcs_duration_s

And also counts.

We discussed adding something like this to container registry as part of container-registry#739 (closed). In light of incidents requiring latency diagnosis, we may want to prioritise this work.

Since this is a matter of operability and production maturity, it might make sense to engage with the newly established Reliability:Practices SRE team.

cc @jdrpereira @steveazz @kwanyangu

Edited Aug 18, 2025 by 🤖 GitLab Bot 🤖