2019-11-11 CI runners and Registry errors and high latency
From 2019-11-11 10:40UTC to 12:00UTC GitLab.com experience significant delays in processing CI Jobs. This appears to have been related to underlying issues on GCP per: https://status.cloud.google.com/incident/cloud-datastore/19006
All times UTC.
- 11:10 - increased CI latency
- 11:40 - EOC is paged
- 11:45 - number of started CI jobs drops
- 11:46 - CMOC is paged
- 11:57 - posted to status.io - imoc/cmoc on
- 11:58 - update on GCP status page: https://status.cloud.google.com/?_ga=2.259914314.-1916113232.1529097922 Cloud Datastore and Networking are experiencing issues.
- 12:51 - GCP update that they have identified a potential cause and are rolling out mitigation.
- 13:45 - GCP update that mitigations have been put in place though they continue to work https://status.cloud.google.com/incident/storage/19009
- 13:50 - Registry request rates and error rate for the last 20 minutes are looking healthier.
potentially related to: #1348 (closed)
No timeline items have been added yet.
- Show closed items
Relates to
- production-engineering #83991
Newest first Oldest first
Show all activity Show comments only Show history only
- Michal Wasilewski added incident severity2 labels
- Michal Wasilewski mentioned in issue on-call-handovers#57 (closed)
mentioned in issue on-call-handovers#57 (closed)
- Michal Wasilewski changed the description
changed the description
- Dave Smith changed the description
changed the description
- Dave Smith changed title from 2019-11-11 CI runners errors to 2019-11-11 CI runners errors and high latency
changed title from 2019-11-11 CI runners errors to 2019-11-11 CI runners errors and high latency
I see that the issues with the shared runners occur again and again.
- Maintainer
This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with
.Thanks for your help!
You are welcome to help improve this comment.
- Owner
CI Service graphs:
1 - Dave Smith assigned to @mwasilewski-gitlab
assigned to @mwasilewski-gitlab
- Dave Smith changed the description
changed the description
Registry is not working neither
20 2Collapse replies I'm also having issues with registry.
14 1 1- Developer
- Owner
Yes - registry does appear to be affected by these issues with GCP too:
Edited by Dave Smith - Owner
These issues appear to be related to this incident in GCP: https://status.cloud.google.com/incident/cloud-datastore/19006
Google has published that the issue was solved, but the registry still doesn't seems to work.
2 1- Owner
@lucasayb97 - we still see the dashboard https://status.cloud.google.com/ - showing GCP cloud storage as having an incident which would be affecting registry. Sorry for the problems, we are awaiting the next update from them.
1 - Owner
https://status.cloud.google.com/incident/cloud-datastore/19006 should be updated at 05:30 PT which we are watching, but still see as red. It does look at a very early state that things are just starting to get better.
The issue was apparently solved by google, but the registry still has problems
- Owner
It does look like things are beginning to recover. We are still tracking the other related issue in GCP on https://status.cloud.google.com/incident/storage/19009.
4 Still not able to publish an image in registry.
1Can't pull either ^^
- Owner
Thanks for the heads up - we'll look into this.
Now, for me it's working correctly!
- Dave Smith changed title from 2019-11-11 CI runners errors and high latency to 2019-11-11 CI runners and Registry errors and high latency
changed title from 2019-11-11 CI runners errors and high latency to 2019-11-11 CI runners and Registry errors and high latency
- Dave Smith changed the description
changed the description
- Dave Smith added 1 deleted label
added 1 deleted label
- Andrew Newdigate added VendorGoogle label
added VendorGoogle label
- Owner
Confounding factor note: some team members were having issues loading dashboards due to the underlying issues. We should again note this in our IncidentReview and come up with plans to mitigate.
- Dave Smith changed the description
changed the description
- Dave Smith marked this issue as related to #1348 (closed)
marked this issue as related to #1348 (closed)
- Dave Smith mentioned in issue #1348 (closed)
mentioned in issue #1348 (closed)
- Michal Wasilewski mentioned in issue on-call-handovers#58 (closed)
mentioned in issue on-call-handovers#58 (closed)
- Dave Smith changed the description
changed the description
- Owner
- Maintainer
I think this issue is ended. Added a link to the RCA issue.
- Cameron McFarland closed
- ops-gitlab-net mentioned in issue #1355 (closed)
mentioned in issue #1355 (closed)
- cznic mentioned in issue cznic/xc#2 (closed)
mentioned in issue cznic/xc#2 (closed)
- Marin Jankovski added ServiceCI Runners label and removed 1 deleted label
added ServiceCI Runners label and removed 1 deleted label
- Maintainer
This incident was closed before the IncidentResolved label was applied applied—or after a review was started by not completed. The issue is being reopened so that it will appear on the Production Incidents board. Please mark it as either IncidentResolved or IncidentReview-Completed before closing.
Please review the Incident Workflow section on the Incident Management handbook page for more information.
- 🤖 GitLab Bot 🤖 added auto updated label
added auto updated label