Reliability Team Staff Report for period: 2020-06-16 - 2020-06-23
Director's Notes
Team Updates
Core Infrastructure
Datastores
Observability
This week we're continuing to make strides migrating o11y services from VMs to k8s:
- pubsubbeat: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8962
- Alertmanager: &234 (closed)
- Grafana (prep): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7788
We've recently become the custodians of the pubsubbeat open-source project. After we forked it to GitLab, Google archived their GitHub project — we're maintainers!
Additionally, we're continuing to focus on a https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10095 to determine ways to diversify our stack and how to best utilize SaaS and self-managed components in our pipeline.
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Dave Smith |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Nels Nelson |
PagerDuty Incidents
* Number of incidents: **58**
Show/Hide Table
Created | Summary |
---|---|
2020-06-16T14:53:45Z | [21941] Firing 1 - monitor.gitlab.net is down |
2020-06-16T20:13:07Z | [21943] Firing 1 - Increased Server Response Errors |
2020-06-16T21:57:54Z | [21944] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-16T23:44:54Z | [21946] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-06-17T05:58:09Z | [21947] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-17T08:50:08Z | [21949] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-17T08:50:08Z | [21950] Firing 1 - Last successful walg basebackup was seen 11m 17s ago. |
2020-06-17T15:21:40Z | [21951] Firing 1 - Gitaly is down on file-cny-01-stor-gprd.c.gitlab-production.internal |
2020-06-17T16:51:59Z | [21953] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-17T16:51:59Z | [21954] Firing 1 - Last successful walg basebackup was seen 11m 25s ago. |
2020-06-17T18:38:04Z | [21955] Firing 1 - Gitaly error rate is too high: 7.11 |
2020-06-17T19:26:07Z | [21956] Firing 1 - Gitaly error rate is too high: 7.58 |
2020-06-17T23:37:43Z | [21958] Firing 1 - Chef client failures have reached critical levels |
2020-06-18T11:12:50Z | [21963] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T13:31:52Z | [21964] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T13:42:21Z | [21965] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T13:42:35Z | [21966] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T13:47:14Z | [21967] Firing 1 - Last WALE backup was seen 20m 8s ago. |
2020-06-18T13:51:50Z | [21968] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T14:13:45Z | [21969] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-06-18T14:17:50Z | [21970] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-18T14:33:45Z | [21971] Firing 1 - Failed to collect Redis metrics Check the status of redis on redis-sidekiq-03-db-gprd.c.gitlab-production.internal:9121 with gitlab-ctl status . |
2020-06-18T17:19:32Z | [21972] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-06-18T22:00:33Z | [21974] Multiple full backups are created daily by WAL-E (should be once per day) |
2020-06-19T00:56:59Z | [21975] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-19T08:58:22Z | [21977] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-19T09:20:02Z | [21978] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-06-19T16:58:41Z | [21981] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-19T17:20:17Z | [21982] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-06-19T19:40:28Z | [21983] Firing 1 - Chef client failures have reached critical levels |
2020-06-19T20:09:28Z | [21984] Firing 1 - Chef client failures have reached critical levels |
2020-06-20T01:00:09Z | [21986] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-20T01:00:09Z | [21987] Firing 1 - Last successful walg basebackup was seen 12m 21s ago. |
2020-06-20T09:02:01Z | [21989] Firing 1 - Last successful walg basebackup was seen 12m 30s ago. |
2020-06-20T09:02:01Z | [21990] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-20T16:07:40Z | [21995] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2020-06-20T16:16:54Z | [21996] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-20T17:03:22Z | [21998] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-20T17:03:22Z | [21997] Firing 1 - Last successful walg basebackup was seen 12m 38s ago. |
2020-06-21T01:03:41Z | [22001] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-21T01:03:41Z | [22000] Firing 1 - Last successful walg basebackup was seen 12m 46s ago. |
2020-06-21T09:05:09Z | [22002] Firing 1 - Last successful walg basebackup was seen 12m 54s ago. |
2020-06-21T09:05:09Z | [22003] Firing 1 - Redis cluster gitlab is missing instances |
2020-06-21T17:07:00Z | [22005] Firing 1 - Last successful walg basebackup was seen 13m 2s ago. |
2020-06-21T20:43:50Z | [22006] Firing 2 - IncreasedServerResponseErrors |
2020-06-21T20:59:05Z | [22007] Firing 2 - IncreasedServerResponseErrors |
2020-06-21T21:03:20Z | [22008] Firing 2 - IncreasedServerResponseErrors |
2020-06-21T21:09:05Z | [22009] Firing 1 - Increased Server Response Errors |
2020-06-21T22:57:24Z | [22010] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-22T01:17:50Z | [22011] Firing 3 - IncreasedServerResponseErrors |
2020-06-22T02:49:22Z | [22012] Firing 1 - Increased Server Response Errors |
2020-06-22T03:14:50Z | [22013] Firing 1 - Increased Server Response Errors |
2020-06-22T03:19:50Z | [22014] Firing 1 - Increased Server Response Errors |
2020-06-22T03:24:50Z | [22015] Firing 3 - IncreasedServerResponseErrors |
2020-06-22T06:57:39Z | [22016] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-22T14:57:54Z | [22018] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-06-22T15:37:50Z | [22020] Firing 1 - Increased Server Response Errors |
2020-06-23T01:11:59Z | [22021] Firing 1 - Last successful walg basebackup was seen 13m 34s ago. |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 3
- Incident Issues : 19
- CorrectiveAction Issues : 0
Change Issues
- 2020-06-20T10:20:51Z - Repository migration on gitlab.com (nfs-file42) - glopezfernandez
- 2020-06-20T10:15:26Z - Repository migration on gitlab.com (nfs-file45) - glopezfernandez
- 2020-06-20T10:12:29Z - Repository migration on gitlab.com - unassigned
Incident Issues
- 2020-06-22T19:23:59Z - Elasticsearch cluster not responding on 2020-06-22 around 17:00 UTC - unassigned | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2316
- 2020-06-22T16:01:07Z - 2020-06-22 - Increased Server Response Errors - nnelson | ~S4 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2313
- 2020-06-22T10:58:44Z - 2020-06-22: Quality Engineering experiencing SSL Cert issues with qa-tunnel.gitlab.info SSL cert - hphilipps | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2310
- 2020-06-22T09:06:33Z - Increased Sidekiq mailers error rate - craigf | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2309
- 2020-06-22T03:27:15Z - Intermittent error spikes for pages backend - craig | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2308
- 2020-06-21T20:48:32Z - 2020-06-21 - Increased server response errors - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2306
- 2020-06-20T16:20:33Z - 2020-06-20 - Postgres transactions showing high rate of statement timeouts - nnelson | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2305
- 2020-06-20T13:35:40Z - Elevated error rate for web service - hphilipps | ~S4 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2304
- 2020-06-19T19:46:33Z - 2020-06-19 - Chef client failures have reached critical levels - nnelson | ~S4 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2300
- 2020-06-18T21:55:14Z - 2020-06-18 - Multiple full backups are created daily by WAL-E (should be once per day) - nnelson | ~S3 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2298
- 2020-06-18T14:40:59Z - 2020-06-18: Redis down on redis-sidekiq-03 - hphilipps | ~S3 | ServiceRedis |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2296
- 2020-06-17T18:41:48Z - 2020-06-17 - Gitaly error rate is too high: 7.11 (for file-praefect-02-stor-gprd.c.gitlab-production.internal) - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2294
- 2020-06-17T16:59:01Z - 2020-06-17 - [Note:
ops
environment only] Redis cluster gitlab is missing instances - nnelson | ~S4 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2293
- 2020-06-17T11:11:22Z - 2020-06-17: errors on /api/v4/internal/allowed - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2291
- 2020-06-17T10:08:04Z - 2020-06-17: Cannot create new sessions on GitLab.com - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2290
- 2020-06-17T00:00:00Z - 2020-06-17: Elevated error rates for praefect - craig | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2289
- 2020-06-16T22:01:36Z - 2020-06-16 - Large amount of Sidekiq Queued jobs - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2288
- 2020-06-16T15:42:20Z - 2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering - stanhu | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2285
- 2020-06-16T15:25:34Z - 2020-06-16: Unable to deploy to production due missing configuration - marin | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2284
CorrectiveAction Issues
- 2020-06-22T11:23:28Z - Investigate number of concurrent connections made by some of our services to the same external addresses - unassigned
- 2020-06-17T21:27:47Z - Remove storage shards from deploy nodes' Praefect configuration - unassigned
- 2020-06-17T13:38:24Z - Review cache TTLs (and global default TTL) - unassigned
- 2020-06-16T16:40:37Z - "ci_jwt_signing_key" needs to be configured in all GitLab environments - skarbek
Open Issue Stats
- Oncall issues : 5
- Change issues : 9
- Incident issues : 16
- Access Request : 4
- CorrectiveAction : 97
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-20T10:20:51Z | glopezfernandez | Repository migration on gitlab.com (nfs-file42) |
2020-06-20T10:15:26Z | glopezfernandez | Repository migration on gitlab.com (nfs-file45) |
2020-06-20T10:12:29Z | unassigned | Repository migration on gitlab.com |
2020-06-12T09:25:28Z | unassigned | Repository migration on gitlab.com (nfs-file09) |
2020-06-12T09:25:17Z | unassigned | Repository migration on gitlab.com (nfs-file08) |
2020-06-12T09:25:06Z | unassigned | Repository migration on gitlab.com (nfs-file07) |
2020-06-11T20:51:56Z | unassigned | Repository migration on gitlab.com (nfs-file03) |
2020-06-08T22:05:46Z | nnelson | Migrate large projects off file-42-stor-gprd to file-02-stor-gprd |
2020-03-26T19:16:25Z | alejandro | Rotate credentials for user gitlab-superuser
|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-22T19:23:59Z | unassigned | Elasticsearch cluster not responding on 2020-06-22 around 17:00 UTC |
2020-06-22T09:06:33Z | craigf | Increased Sidekiq mailers error rate |
2020-06-22T03:27:15Z | craig | Intermittent error spikes for pages backend |
2020-06-18T21:55:14Z | nnelson | 2020-06-18 - Multiple full backups are created daily by WAL-E (should be once per day) |
2020-06-18T14:40:59Z | hphilipps | 2020-06-18: Redis down on redis-sidekiq-03 |
2020-06-17T10:08:04Z | unassigned | 2020-06-17: Cannot create new sessions on GitLab.com |
2020-06-16T15:42:20Z | stanhu | 2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering |
2020-06-16T15:25:34Z | marin | 2020-06-16: Unable to deploy to production due missing configuration |
2020-06-10T13:53:13Z | ahanselka | 2020-06-10: Elevated web latency |
2020-06-09T11:27:23Z | nolith | 2020-06-09: post-deployment migration failure |
2020-06-08T04:08:25Z | unassigned | 2020-06-08: High rate of canary errors: DDoS |
2020-06-05T13:39:49Z | unassigned | 2020-06-05: increased error rates on the web service |
2020-06-05T07:58:05Z | unassigned | 2020-06-05: surge in authorized_project_update jobs is saturating catchall workers |
2020-06-04T03:17:59Z | cmiskell | 2020-06-04 Large load spike on API fleet causing response degradation |
2020-05-29T09:07:54Z | nolith | 2020-05-29: HTTP 401s on various components of the GitLab UI |
2020-05-29T05:21:12Z | ggillies | 2020-05-29: gitlab.com is down |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-19T13:23:10Z | hphilipps | Write runbook for Project Export |
2020-06-10T19:02:48Z | unassigned | Import request (for alex-solutions/core): alex-app |
2020-05-25T05:05:45Z | albertoramos | Archived repository missing |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by AnthonySandoval