Incident Investigation Follow-Up: 2022-03-11: SSL certificate problem with shared runners & submodules
Summary
During the incident there were attempts to identify indications of this failure from sources such as logs. We were unsuccessful in identifying a good source. It is believed that the failure indications are likely within the job logs themselves and not expressed anywhere outside of the build/job. If possible we'd like to identify a source to answer:
- how many failures there were
- how many projects were impacted
We unfortunately do not have a good way to measure this, I've opened gitlab-org/gitlab-runner#28950 to track improving our metrics on the runner side. In general, we did not notice a marked increase in the overall error count.
- the precise start and end time of the impact
See time range on the linked incident issue.
- should we have utilized green / blue deployments for this change?
What we should have done is noted that this change was safe to revert without a runner manager restart, so no draining was necessary. At the time, we did discuss using blue/green but decided against it, given it was simple and fast to revert if there were issues.
Root cause
We saw failures for customers who met the following conditions in their CI configuration:
- If they were using submodules with relative paths
- If the relative paths didn't include the
.git
extension for the relative git repositories
Although our docs say to use .git
at the end of the repository I believe it is pretty common for users to omit it.
Related Incident(s)
Originating issue(s): production#6563 (closed)