Investigate "write: connection timed out" errors
Context
We have recently introduced support for error reporting with Sentry.
The rollout of this feature for GitLab.com happened on Jan 12, 2021 (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11297#note_483587713). Before this, we had no insight into the unexpected errors raised by the registry (unless we wanted to grep over logs). Since then, I have been watching the errors on Sentry to identify patterns and possible improvements.
Problem
I'm seeing a large amount of unknown: unknown error: write tcp <some IP>-><other IP>: write: connection timed out errors (sample).
Because the IP addresses vary, so do the error messages, so it's not easy to get a count of such errors, but the amount is relatively low (see this list).
By correlating several of these events with logs (using the recently introduced correlation_id field), we can see that at least a good portion of these seem related to tag list requests (sample). Because these requests can take a long time for really large repositories, it's likely that this is due to a client timeout. However, this is not completely clear and needs additional investigation.
Proposal
-
Identify the root cause for these errors, based on the analysis of a considerable portion of events (e.g., 20, randomly picked). -
Consider if the current 500 Internal Server Errorerror is the best status for these errors.