Record latency attribution in container registry logs
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
The container registry logs currently do not provide a lot of additional information besides the core request metadata. In particular, there is little information about where the request spent its time.
This makes it hard to attribute apdex drops. As the latency increase could be due to a number of factors:
- Database
- Redis
- GCS
- CPU overload
- Lock contention
- etc.
This was recently surfaced as part of gitlab-com/gl-infra/production#8376 (closed). There was no easy way to tie the slowdown to an underlying database issue. SREs and devs need to guess as to what the source of the latency might be.
Most of our other applications (most notably rails and gitaly) have rich latency attribution data in their logs. A recent example from gitaly is https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2167.
Ideally we would have log fields for each external call contributing to latency:
database_duration_sredis_duration_sgcs_duration_s
And also counts.
We discussed adding something like this to container registry as part of container-registry#739 (closed). In light of incidents requiring latency diagnosis, we may want to prioritise this work.
Since this is a matter of operability and production maturity, it might make sense to engage with the newly established Reliability:Practices SRE team.
cc @jdrpereira @steveazz @kwanyangu