2021-01-20 Clone of gitlab-org/gitlab is timing out, preventing release tagging
Summary
When preparing the 13.8 release the merge:foss job repeated failed. Suspected to be the same issue as https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3104#note_486138053.
See the timeline below for full details and links.
Timeline
All times UTC.
2021-01-20
-
16:51
- Command to create an RC is triggered https://ops.gitlab.net/gitlab-org/release/tools/-/jobs/2890564 -
16:52
- merge:foss` job triggered by the RC fails https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2890566 -
16:58
- job is retried and fails again https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2890569 -
17:18
- CI cache is flushed by "Clear Runners Cache" -
17:19 -
job is retried and fails https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2890684 -
17:37
- MR is created to clean the directory after a failed cloned gitlab-org/merge-train!36 (merged) -
11:39
- Job is retried and fails again https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2890813 -
17:50
-@mayra-cabrera
declares incident in Slack. -
18:05
- Scheduledmaster
sync start to fail as well https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2891056 -
18:28
- MR is added to clone verbosely rather than silently gitlab-org/merge-train!37 (merged) -
18:33
- Job running with verbose clone https://ops.gitlab.net/gitlab-org/merge-train/-/jobs/2891245. No further details are shown: - `
debug2: channel 0: window 1945493 sent adjust 8192
debug2: channel 0: window 1953685 sent adjust 8192
debug2: channel 0: window 1961877 sent adjust 8192
debug2: channel 0: window 1970069 sent adjust 8192
debug2: channel 0: window 1978261 sent adjust 8192
debug2: channel 0: window 1986453 sent adjust 8192
debug2: channel 0: window 1994645 sent adjust 8192
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
#0 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 4/5/6 sock -1 cc -1)
debug1: fd 0 clearing O_NONBLOCK
debug3: fd 1 is not O_NONBLOCK
debug1: fd 2 clearing O_NONBLOCK
Connection to gitlab.com closed by remote host.
Transferred: sent 1888132, received 968892048 bytes, in 177.7 seconds
Bytes per second: sent 10624.4, received 5451918.4
debug1: Exit status -1
-
19:20
- We believe the problem might be the same as https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3104#note_486138053 -
19:34
- To unblock the RC, merge-train was executed locally by@rspeicher
- https://gitlab.com/gitlab-org/gitlab-foss/commits/13-8-stable -
19:41
- FOSS merge was skipped during the RC creation gitlab-org/release-tools!1351 (merged) -
19:46
- RC command was completed https://ops.gitlab.net/gitlab-org/release/tools/-/jobs/2891919
Corrective Actions
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Time to detection:
- Minutes downtime or degradation:
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Incident Review Stakeholders
Edited by Amy Phillips