Network timeouts between the Rails backend and the Container Registry
Details
- Point of contact for this request: @10io
- If a call is needed, what is the proposed date and time of the call: not needed
- Additional call details (format, type of call): n / a
SRE Support Needed
🎋 Context
For some features, the Rails backend needs to contact the Container Registry.
Looking at this Sentry, it seems that sometimes we encounter network timeouts where the connection can't be open to the Container Registry.
In gitlab-org/gitlab!50750 (merged), we improved the observability of the Container Registry ruby client used by the rails backend.
This led to this dashboard where we can clearly see the network errors and which url in the Container Registry was contacted.
The Container Registry ruby client uses 3 different timeouts for its network operations:
- Open timeout (10s)
- Read timeout (20s)
- Write timeout (30s)
From the Kibana dashboard above, the majority of the errors is (1.) the open timeout. As an example, see this error in Sentry. It's happening when net/http.rb initializes the connection and opens it.
Note that because those errors are in majority when the connection is established, there are no traces of such errors in the Container Registry logs.
💥 Users impact
The Container Registry ruby client is used to power these features:
-
Read and write operations on Container Registry objects from the UI
- Usually, the outcome here is that the user will see an error on the page.
- For write operations, we could have stale data.
- Return Container Registry on the public rest or graphql API
- The APIs will return an error
-
Cleanup policies
- The limited capacity worker will lower by one the number of jobs executed in parallel.
These impacts are low by definition and don't break any major feature.
Having said that, the cleanup policies workers deal with a non trivial amount of daily work. Lowering the number of "slots" by one will reduce their efficiency.
🚒 SRE Support Request
Given the above, we want to make sure that everything is working fine between the Rails backend and the Container Registry.
To my limited knowledge, some components (such as proxies) can be present between them.
- Can these intermediary components provide any logs?
- Is there any errors that could explain these "random" timeouts
- Is there any kind of rate limiting applied here that would make the connection to be dropped and the Container Registry ruby client can't establish one.