2020-10-16: Failed to open TCP connection to gitlab-registry:5000 (Connection refused - connect(2) for "gitlab-registry" port 5000)
Summary
2020-10-16: Failed to open TCP connection to gitlab-registry:5000 (Connection refused - connect(2) for "gitlab-registry" port 5000)
Creating this incident in order to note and track around 16 distinct occurrences of the error: Failed to open TCP connection to gitlab-registry:5000 (Connection refused - connect(2) for "gitlab-registry" port 5000)
I suspect that this is a hiccup in Google's networking, because it apparently no longer occurs, and subsequent job retries were successful.
There were no production alerts about these errors.
Timeline
All times UTC.
2020-10-16
-
18:39- Error occurrence:Failed to open TCP connection to gitlab-registry:5000 (No route to host - connect(2) for "gitlab-registry" port 5000): https://sentry.gitlab.net/gitlab/gitlabcom/issues/1908587/?query=is%3Aunresolved%20500 -
20:15- Error occurrence:Failed to open TCP connection to gitlab-registry:5000 (No route to host - connect(2) for "gitlab-registry" port 5000): https://sentry.gitlab.net/gitlab/gitlabcom/issues/1908617/?query=is%3Aunresolved%20500 -
20:21- @cwoolley-gitlab brings errors from his pipeline jobs to our attention. -
21:36- @nnelson declares incident in Slack using/incident declarecommand.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Nels Nelson