2019-11-11 CI runners and Registry errors and high latency
Summary
From 2019-11-11 10:40UTC to 12:00UTC GitLab.com experience significant delays in processing CI Jobs. This appears to have been related to underlying issues on GCP per: https://status.cloud.google.com/incident/cloud-datastore/19006
Timeline
All times UTC.
2019-11-11
- 11:10 - increased CI latency
- 11:40 - EOC is paged
- 11:45 - number of started CI jobs drops
- 11:46 - CMOC is paged
- 11:57 - posted to status.io - imoc/cmoc on
- 11:58 - update on GCP status page: https://status.cloud.google.com/?_ga=2.259914314.-1916113232.1529097922 Cloud Datastore and Networking are experiencing issues.
- 12:51 - GCP update that they have identified a potential cause and are rolling out mitigation.
- 13:45 - GCP update that mitigations have been put in place though they continue to work https://status.cloud.google.com/incident/storage/19009
- 13:50 - Registry request rates and error rate for the last 20 minutes are looking healthier.
potentially related to: #1348 (closed)
No timeline items have been added yet.
- Show closed items
Relates to
- production-engineering #83991
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Michal Wasilewski added incident severity2 labels
- Michal Wasilewski mentioned in issue on-call-handovers#57 (closed)
mentioned in issue on-call-handovers#57 (closed)
- Michal Wasilewski changed the description
changed the description
- Dave Smith changed the description
changed the description
- Dave Smith changed title from 2019-11-11 CI runners errors to 2019-11-11 CI runners errors and high latency
changed title from 2019-11-11 CI runners errors to 2019-11-11 CI runners errors and high latency
I see that the issues with the shared runners occur again and again.
- Maintainer
This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with
Service:
.Thanks for your help!
You are welcome to help improve this comment.
- Owner
CI Service graphs:
1 - Dave Smith assigned to @mwasilewski-gitlab
assigned to @mwasilewski-gitlab
- Dave Smith changed the description
changed the description
Registry is not working neither
20 2Collapse replies I'm also having issues with registry.
14 1 1- Developer
- Owner
Yes - registry does appear to be affected by these issues with GCP too:
Edited by Dave Smith - Owner
These issues appear to be related to this incident in GCP: https://status.cloud.google.com/incident/cloud-datastore/19006
Google has published that the issue was solved, but the registry still doesn't seems to work.
2 1- Owner
@lucasayb97 - we still see the dashboard https://status.cloud.google.com/ - showing GCP cloud storage as having an incident which would be affecting registry. Sorry for the problems, we are awaiting the next update from them.
1 - Owner
https://status.cloud.google.com/incident/cloud-datastore/19006 should be updated at 05:30 PT which we are watching, but still see as red. It does look at a very early state that things are just starting to get better.
The issue was apparently solved by google, but the registry still has problems
- Owner
It does look like things are beginning to recover. We are still tracking the other related issue in GCP on https://status.cloud.google.com/incident/storage/19009.
4 Still not able to publish an image in registry.
1Can't pull either ^^
- Owner
Thanks for the heads up - we'll look into this.
Now, for me it's working correctly!
3
- Dave Smith changed title from 2019-11-11 CI runners errors and high latency to 2019-11-11 CI runners and Registry errors and high latency
changed title from 2019-11-11 CI runners errors and high latency to 2019-11-11 CI runners and Registry errors and high latency
- Dave Smith changed the description
changed the description
- Dave Smith added 1 deleted label
added 1 deleted label
- Andrew Newdigate added VendorGoogle label
added VendorGoogle label
- Owner
Confounding factor note: some team members were having issues loading dashboards due to the underlying issues. We should again note this in our IncidentReview and come up with plans to mitigate.
- Dave Smith changed the description
changed the description
- Dave Smith marked this issue as related to #1348 (closed)
marked this issue as related to #1348 (closed)
- Dave Smith mentioned in issue #1348 (closed)
mentioned in issue #1348 (closed)
- Michal Wasilewski mentioned in issue on-call-handovers#58 (closed)
mentioned in issue on-call-handovers#58 (closed)
- Dave Smith changed the description
changed the description
- Owner
- Maintainer
I think this issue is ended. Added a link to the RCA issue.
- Cameron McFarland closed
closed
- ops-gitlab-net mentioned in issue #1355 (closed)
mentioned in issue #1355 (closed)
- cznic mentioned in issue cznic/xc#2 (closed)
mentioned in issue cznic/xc#2 (closed)
- Marin Jankovski added ServiceCI Runners label and removed 1 deleted label
added ServiceCI Runners label and removed 1 deleted label
- Maintainer
This incident was closed before the IncidentResolved label was applied applied—or after a review was started by not completed. The issue is being reopened so that it will appear on the Production Incidents board. Please mark it as either IncidentResolved or IncidentReview-Completed before closing.
Please review the Incident Workflow section on the Incident Management handbook page for more information.
/open
- 🤖 GitLab Bot 🤖 added auto updated label
added auto updated label