Skip to content

Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-12 - 2021-01-19

Announcements

  • GCP Account Team Q&A - On Weds Jan 20th we have two open sessions with our GCP account team where you're all welcome to join to hear a bit more, but also engage in some Q&A for anything GCP related you'd like. There are two sessions, feel free to attend whichever work better for you, or if you prefer just catch up on the recording afterwards.
  • Kubernetes Firedrills Continue. Please take notice of the incoming Calendar invites.
  • GitLab Learn - https://gitlab.edcast.com/ - see #company-fyi https://gitlab.slack.com/archives/C010XFJFTHN/p1610380760108800
  • Stage Group Dashboards are up! Thanks, ~"team::Scalability" - https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups
  • EOC/IMOC please make sure to review the Change Management and Incident Management handbook pages before the start of each shift. If you have questions, please bring them up with your manager.

Engineering Week in Review Highlights:

Team Updates

Core Infrastructure

Datastores

Observability

  • Jeager Readiness Review is blocked awaiting implementation of restricted IAP Google Group configuration to scope permissions to the Engineering department.
  • The team is investigating the urgency of a Redis 6 upgrade.
  • Thanos on Kubernetes readiness review.

On-Call During This Period

Schedule Username
SRE 8-hour Americas Alejandro Rodriguez
SRE 8-hour Americas Cindy Pallares
SRE 8-hour Americas Cameron McFarland
SRE 8-hour APAC Craig Barrett
SRE 8-hour EMEA Craig Furman

PagerDuty Incidents

* Number of incidents: **58** Show/Hide Table
Created Summary
2021-01-12T00:27:51Z [34419] Firing 1 - Some repositories are in read-only mode.
2021-01-12T01:20:19Z [34420] Firing 1 - GitLab Job has failed
2021-01-12T02:23:46Z [34422] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
2021-01-12T03:37:44Z [34424] Firing 1 - Blackbox probes for https://dev.gitlab.org are failing.
2021-01-12T13:34:50Z [34434] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-12T13:34:50Z [34435] Firing 1 - Increased Server Connection Errors
2021-01-12T13:43:35Z [34437] Firing 1 - Increased Server Connection Errors
2021-01-12T13:43:36Z [34438] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-12T15:16:50Z [34440] Firing 1 - Increased Server Connection Errors
2021-01-12T15:16:51Z [34441] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-12T15:28:14Z [34443] Firing 1 - Blackbox probes for https://about.gitlab.com/handbook/values/ are failing.
2021-01-12T21:01:50Z [34456] Firing 5 - IncreasedServerConnectionErrors
2021-01-12T21:01:51Z [34457] Firing 5 - IncreasedBackendConnectionErrors
2021-01-12T21:02:20Z [34458] Firing 5 - IncreasedServerResponseErrors
2021-01-12T21:09:59Z [34461] Firing 3 - BlackboxProbeFailures
2021-01-12T21:12:20Z [34463] Firing 5 - IncreasedServerResponseErrors
2021-01-12T21:17:20Z [34465] Firing 2 - IncreasedServerConnectionErrors
2021-01-12T21:22:21Z [34467] Firing 1 - Increased Server Response Errors
2021-01-12T22:12:32Z [34470] Firing 1 - The Disk Space Utilization per Device per Node resource of the ci-runners service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-13T07:02:32Z [34478] Firing 1 - The Disk Space Utilization per Device per Node resource of the patroni service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-14T00:09:59Z [34504] Firing 1 - Blackbox probes for https://version.gitlab.com are failing.
2021-01-14T00:50:32Z [34505] Firing 1 - GitLab Job has failed
2021-01-14T01:40:17Z [34507] Firing 1 - GitLab Job has failed
2021-01-14T01:47:52Z [34508] Firing 1 - thanos is restarting frequently
2021-01-14T01:54:28Z [34509] Firing 1 - Chef client failures have reached critical levels
2021-01-14T11:32:35Z [34521] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-14T11:32:36Z [34522] Firing 1 - Increased Server Connection Errors
2021-01-14T13:16:50Z [34524] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-14T15:35:32Z [34526] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-14T15:54:32Z [34527] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-14T20:27:52Z [34529] Firing 1 - Some repositories are in read-only mode.
2021-01-14T22:03:32Z [34534] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-delayed service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-14T22:43:20Z [34536] Firing 1 - Increased Server Connection Errors
2021-01-15T00:21:32Z [34542] Firing 1 - The Disk Space Utilization per Device per Node resource of the nfs service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-15T01:16:13Z [34545] Firing 1 - Chef client failures have reached critical levels
2021-01-15T03:28:59Z [34548] Firing 1 - Blackbox probes for https://registry.ops.gitlab.net are failing.
2021-01-15T12:36:45Z [34551] Firing 1 - Blackbox probes for https://about.gitlab.com/handbook/engineering/projects/ are failing.
2021-01-15T12:51:45Z [34552] Firing 1 -
2021-01-15T15:58:33Z [34555] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-15T18:20:05Z [34557] Firing 1 - Increased Error Rate Across Fleet
2021-01-15T18:22:15Z [34558] Firing 2 - BlackboxProbeFailures
2021-01-15T21:51:46Z [34570] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO
2021-01-15T22:11:32Z [34572] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-delayed service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-16T00:11:14Z [34585] Firing 1 - Blackbox probes for https://gitlab.com/explore/projects/starred are failing.
2021-01-16T00:46:45Z [34587] Firing 1 - Blackbox probes for https://pre.gitlab.com are failing.
2021-01-16T23:39:43Z [34609] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
2021-01-16T23:40:05Z [34610] Firing 4 - IncreasedErrorRateOtherBackends
2021-01-16T23:40:08Z [34611] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down
2021-01-16T23:40:19Z [34612] Firing 1 - High Error Rate on Front End Web
2021-01-16T23:40:30Z [34613] Firing 12 - BlackboxProbeFailures
2021-01-16T23:42:05Z [34615] Firing 7 - IncreasedServerResponseErrors
2021-01-16T23:42:41Z [34616] Firing 2 - PostgreSQL_ServiceDown
2021-01-16T23:43:05Z [34617] Firing 2 - IncreasedBackendConnectionErrors
2021-01-16T23:43:05Z [34618] Firing 2 - IncreasedServerConnectionErrors
2021-01-17T15:03:50Z [34670] Firing 4 - IncreasedServerResponseErrors
2021-01-17T16:34:32Z [34673] Firing 1 - The Disk Space Utilization per Device per Node resource of the security service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-18T00:25:46Z [34679] Need approval to start C2 rate-limiting change
2021-01-18T03:27:32Z [34684] Firing 1 - The Disk Space Utilization per Device per Node resource of the gitlab service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.

7 Day Issue Stats

  • Oncall issues : 3
  • Access Request : 0
  • Change Issues : 1
  • Incident Issues : 41
  • CorrectiveAction Issues : 1

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Show/Hide Table
Created Summary
2021-01-11T16:50:32Z Enable automated database reindexing (on workdays, once a day)

Open Incident Issues

Show/Hide Table
Created Summary

Open Oncall Issues

Show/Hide Table
Created Summary
2021-01-14T19:08:09Z Project Import Request - skylight-tools: salesforce-integrations
2021-01-14T19:07:13Z Project Import Request - skylight-tools: renohub
2021-01-14T19:04:10Z Project Import Request - skylight-tools: cas
2020-12-31T15:49:42Z One-Time Export for MakeMeReach
2020-12-18T22:29:14Z CI clones fail for repositories with a path ending in a period
2020-09-14T18:52:09Z PS Congregate VM for GitHost to GitLab.com Migration - Afilias
2020-09-02T13:47:51Z disable-chef-client isn't preserved over reboots
2020-08-11T16:39:37Z Investigate slow child pipeline triggering on pre.gitlab.com
2020-07-28T18:19:35Z PS Congregate VM for BitBucket Server to GitLab.com Migration
2020-07-28T17:43:40Z Project Import Request - ciorg/bridge/am-child-pool/api
2020-03-30T13:38:11Z jobs.gitlab.com cert expired unnoticed on 2020-03-28
Edited by Alberto Ramos
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information