Scale Rails webhook processing to handle container registry pull events
The container registry currently sends webhook notifications only for write operations to Rails, which processes them for usage metrics and other purposes.
Enabling pull notifications, basing on this analysis will lead to 170x increase in webhook traffic while Rails can currently handle only a 2x increase, making it the primary bottleneck preventing this feature from being enabled.
My strong suspicion is that Rails tail/95th percentile latency is high and it increases even higher with as the number of processed notifications grows.
One approach to address it could be using tracing to analyse slow requests and pinpointing what exactly causes high 95th percentile latency (database?). Another could be exploring using Snowplow events instead of direct webhook notifications. The Analytics Instrumentation team is developing a Go SDK for Snowplow (labkit!250), which would allow the registry to emit events directly to Snowplow. If Rails can consume Snowplow events for usage metrics, this could avoid the additional API load entirely while still capturing the necessary data.
Due to the scale and distributed nature of Rails, we can't really benchmark Rails in artificial environment. We do can evaluate our change by observing the latency of the container-registry notifications HTTP worker and the number of retries though, especially during the occasional spikes in sent notifications.
The goal would be to eliminate the need to retry and have a constant-time submission latency for notifications events.