You need to sign in or sign up before continuing.
Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-12 - 2021-01-19
Announcements
- GCP Account Team Q&A - On Weds Jan 20th we have two open sessions with our GCP account team where you're all welcome to join to hear a bit more, but also engage in some Q&A for anything GCP related you'd like. There are two sessions, feel free to attend whichever work better for you, or if you prefer just catch up on the recording afterwards.
- Kubernetes Firedrills Continue. Please take notice of the incoming Calendar invites.
- GitLab Learn - https://gitlab.edcast.com/ - see #company-fyi https://gitlab.slack.com/archives/C010XFJFTHN/p1610380760108800
- Stage Group Dashboards are up! Thanks, ~"team::Scalability" - https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups
- EOC/IMOC please make sure to review the Change Management and Incident Management handbook pages before the start of each shift. If you have questions, please bring them up with your manager.
Engineering Week in Review Highlights:
Team Updates
Core Infrastructure
Datastores
- We continue to clean inefficient queries that load our DB, like the Select 1 query. The DB is doing better since early January even if there are still CPU spikes that @Finotto continues to investigate.
- We worked with @craig to extend our DB Filesystem in production to 16Tb (thanks @craig!) . The CGP snapshot we took on the leader node forced a Patroni failover - we should avoid snapshotting a running PG leader as much as possible.
- 3rd Geo Promotion Test in staging, happening tomorrow at 3pm. @hphilipps continues to lead this effort.
- We are working hard with Security and Compliance to validate our DB Benchmarking Environment as another production-like environment for GitLab in GCP.
- Also focusing on trying to complete the first phase of our Move Free-inactive Gitaly projects to HDD project.
Observability
- Jeager Readiness Review is blocked awaiting implementation of restricted IAP Google Group configuration to scope permissions to the Engineering department.
- The team is investigating the urgency of a Redis 6 upgrade.
- Thanos on Kubernetes readiness review.
On-Call During This Period
Schedule | Username |
---|---|
SRE 8-hour Americas | Alejandro Rodriguez |
SRE 8-hour Americas | Cindy Pallares |
SRE 8-hour Americas | Cameron McFarland |
SRE 8-hour APAC | Craig Barrett |
SRE 8-hour EMEA | Craig Furman |
PagerDuty Incidents
* Number of incidents: **58**
Show/Hide Table
Created | Summary |
---|---|
2021-01-12T00:27:51Z | [34419] Firing 1 - Some repositories are in read-only mode. |
2021-01-12T01:20:19Z | [34420] Firing 1 - GitLab Job has failed |
2021-01-12T02:23:46Z | [34422] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416 |
2021-01-12T03:37:44Z | [34424] Firing 1 - Blackbox probes for https://dev.gitlab.org are failing. |
2021-01-12T13:34:50Z | [34434] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-12T13:34:50Z | [34435] Firing 1 - Increased Server Connection Errors |
2021-01-12T13:43:35Z | [34437] Firing 1 - Increased Server Connection Errors |
2021-01-12T13:43:36Z | [34438] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-12T15:16:50Z | [34440] Firing 1 - Increased Server Connection Errors |
2021-01-12T15:16:51Z | [34441] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-12T15:28:14Z | [34443] Firing 1 - Blackbox probes for https://about.gitlab.com/handbook/values/ are failing. |
2021-01-12T21:01:50Z | [34456] Firing 5 - IncreasedServerConnectionErrors |
2021-01-12T21:01:51Z | [34457] Firing 5 - IncreasedBackendConnectionErrors |
2021-01-12T21:02:20Z | [34458] Firing 5 - IncreasedServerResponseErrors |
2021-01-12T21:09:59Z | [34461] Firing 3 - BlackboxProbeFailures |
2021-01-12T21:12:20Z | [34463] Firing 5 - IncreasedServerResponseErrors |
2021-01-12T21:17:20Z | [34465] Firing 2 - IncreasedServerConnectionErrors |
2021-01-12T21:22:21Z | [34467] Firing 1 - Increased Server Response Errors |
2021-01-12T22:12:32Z | [34470] Firing 1 - The Disk Space Utilization per Device per Node resource of the ci-runners service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-13T07:02:32Z | [34478] Firing 1 - The Disk Space Utilization per Device per Node resource of the patroni service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-14T00:09:59Z | [34504] Firing 1 - Blackbox probes for https://version.gitlab.com are failing. |
2021-01-14T00:50:32Z | [34505] Firing 1 - GitLab Job has failed |
2021-01-14T01:40:17Z | [34507] Firing 1 - GitLab Job has failed |
2021-01-14T01:47:52Z | [34508] Firing 1 - thanos is restarting frequently |
2021-01-14T01:54:28Z | [34509] Firing 1 - Chef client failures have reached critical levels |
2021-01-14T11:32:35Z | [34521] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-14T11:32:36Z | [34522] Firing 1 - Increased Server Connection Errors |
2021-01-14T13:16:50Z | [34524] Firing 2 - IncreasedErrorRateOtherBackends |
2021-01-14T15:35:32Z | [34526] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-14T15:54:32Z | [34527] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-14T20:27:52Z | [34529] Firing 1 - Some repositories are in read-only mode. |
2021-01-14T22:03:32Z | [34534] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-delayed service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-14T22:43:20Z | [34536] Firing 1 - Increased Server Connection Errors |
2021-01-15T00:21:32Z | [34542] Firing 1 - The Disk Space Utilization per Device per Node resource of the nfs service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-15T01:16:13Z | [34545] Firing 1 - Chef client failures have reached critical levels |
2021-01-15T03:28:59Z | [34548] Firing 1 - Blackbox probes for https://registry.ops.gitlab.net are failing. |
2021-01-15T12:36:45Z | [34551] Firing 1 - Blackbox probes for https://about.gitlab.com/handbook/engineering/projects/ are failing. |
2021-01-15T12:51:45Z | [34552] Firing 1 - |
2021-01-15T15:58:33Z | [34555] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-archive service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-15T18:20:05Z | [34557] Firing 1 - Increased Error Rate Across Fleet |
2021-01-15T18:22:15Z | [34558] Firing 2 - BlackboxProbeFailures |
2021-01-15T21:51:46Z | [34570] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-15T22:11:32Z | [34572] Firing 1 - The Disk Space Utilization per Device per Node resource of the postgres-delayed service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-16T00:11:14Z | [34585] Firing 1 - Blackbox probes for https://gitlab.com/explore/projects/starred are failing. |
2021-01-16T00:46:45Z | [34587] Firing 1 - Blackbox probes for https://pre.gitlab.com are failing. |
2021-01-16T23:39:43Z | [34609] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2021-01-16T23:40:05Z | [34610] Firing 4 - IncreasedErrorRateOtherBackends |
2021-01-16T23:40:08Z | [34611] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2021-01-16T23:40:19Z | [34612] Firing 1 - High Error Rate on Front End Web |
2021-01-16T23:40:30Z | [34613] Firing 12 - BlackboxProbeFailures |
2021-01-16T23:42:05Z | [34615] Firing 7 - IncreasedServerResponseErrors |
2021-01-16T23:42:41Z | [34616] Firing 2 - PostgreSQL_ServiceDown |
2021-01-16T23:43:05Z | [34617] Firing 2 - IncreasedBackendConnectionErrors |
2021-01-16T23:43:05Z | [34618] Firing 2 - IncreasedServerConnectionErrors |
2021-01-17T15:03:50Z | [34670] Firing 4 - IncreasedServerResponseErrors |
2021-01-17T16:34:32Z | [34673] Firing 1 - The Disk Space Utilization per Device per Node resource of the security service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-18T00:25:46Z | [34679] Need approval to start C2 rate-limiting change |
2021-01-18T03:27:32Z | [34684] Firing 1 - The Disk Space Utilization per Device per Node resource of the gitlab service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
7 Day Issue Stats
- Oncall issues : 3
- Access Request : 0
- Change Issues : 1
- Incident Issues : 41
- CorrectiveAction Issues : 1
Change Issues
- 2021-01-13T13:18:06Z - 2nd Geo Promotion Test in staging
Incident Issues
- 2021-01-18T20:15:20Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3357
- 2021-01-18T20:15:19Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3356
- 2021-01-18T16:51:51Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3354
- 2021-01-18T16:51:49Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3353
- 2021-01-18T16:41:10Z - Prometheus has no targets | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3351
- 2021-01-18T16:41:07Z - Prometheus has no targets | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3350
- 2021-01-18T16:21:38Z - SnitchHeartBeat | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3349
- 2021-01-18T16:21:38Z - SnitchHeartBeat | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3348
- 2021-01-18T16:21:38Z - SnitchHeartBeat | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3347
- 2021-01-18T15:01:36Z - 2021-01-18 CI_PRE_CLONE_SCRIPT is failing | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3346
- 2021-01-18T13:56:06Z - 2021-01-18: Latency spike in frontend / sshServices | reliability~3760141 | ServiceGit |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3345
- 2021-01-18T13:48:32Z - 2021-01-18: sidekiq urgent-other shard has elevated queue apdex | reliability~3760142 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3344
- 2021-01-17T16:40:44Z - 2021-01-17: Ops security scan node out of disk space | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3342
- 2021-01-17T15:08:41Z - 2021-01-17: Increase in frontened errors to shell-gke in us-east1-d | reliability~3760141 | ServiceGit |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3341
- 2021-01-17T09:12:03Z - 2021-01-17: Degraded latency on web canary | reliability~3760142 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3340
- 2021-01-17T06:21:04Z - 2021-01-17: file-44 goserver violating apdex SLO | reliability~3760142 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3339
- 2021-01-16T23:56:55Z - 2021-01-17 - database failover triggered by GCP snapshots | reliability~3760140 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3338
- 2021-01-15T22:01:15Z - The grafana SLI of the monitoring service (
main
stage) has an error rate violating SLO | reliability~3760142 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3337
- 2021-01-15T21:24:32Z - 2021-01-15: Gitaly-59 apdex poor | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3336
- 2021-01-15T18:23:00Z - 2021-01-15: file-58 many git pack operations | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3335
- 2021-01-14T22:46:38Z - 2021-01-14: Increased Server Connection Errors registry | reliability~3760141 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3332
- 2021-01-14T20:29:17Z - 2021-01-14: Some repositories are in read-only mode. | reliability~3760141 | ServicePraefect |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3331
- 2021-01-14T17:27:18Z - 2021-01-14: The proxy SLI of the praefect service (
main
stage) has an error rate violating SLO | reliability~3760141 | ServicePraefect |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3329
- 2021-01-14T11:38:06Z - 2021-01-13: 500 error when visiting the general projects setting page | reliability~3760140 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3327
- 2021-01-14T02:14:43Z - 2021-01-14: thanos is restarting frequently | reliability~3760142 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3326
- 2021-01-13T15:57:57Z - 2021-01-13: The workhorse SLI of the web service (
cny
stage) has an apdex violating SLO | reliability~3760141 | ServiceWeb |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3323
- 2021-01-13T07:12:27Z - 2021-01-13: Postgres disks over 90% full | reliability~3760141 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3319
- 2021-01-12T22:14:13Z - 2021-01-12: The Disk Space Utilization per Device per Node resource of the ci-runners service (main stage), component saturated | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3318
- 2021-01-12T21:03:17Z - 2021-01-12: IncreasedServerResponseErrors on pages nodes | reliability~3760140 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3317
- 2021-01-12T20:06:07Z - 2021-01-12: Node
file-praefect-02-stor-gprd.c.gitlab-production.internal
,goserver
component is violating its apdex SLO | reliability~3760141 | ServicePraefect |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3316
- 2021-01-12T19:49:51Z - 2021-01-12: The server SLI of the registry service (
main
stage) has an apdex violating SLO | reliability~3760140 | ServiceContainer Registry |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3315
- 2021-01-12T15:29:34Z - 2021-01-12: Blackbox probes for https://about.gitlab.com/handbook/values/ are failing. | reliability~3760141 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3314
- 2021-01-12T13:43:33Z - 2021-01-12: Degraded latency on registry server SLI | reliability~3760142 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3312
- 2021-01-12T13:36:59Z - 2021-01-12: Elevated error rates on various frontend SLIs | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3311
- 2021-01-12T12:48:31Z - 2021-01-12: Flappy log rejection errors from audit logs | reliability~3760142 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3310
- 2021-01-12T10:43:36Z - 2021-01-12: CI IP address quota > 90% | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3309
- 2021-01-12T09:48:12Z - 2021-01-12: thanos-compact has not run for over 30 days in production | reliability~3760142 | ~"Service::Monitoring" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3308
- 2021-01-12T08:53:16Z - 2021-01-12: altssh haproxy exporters down | reliability~3760142 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3307
- 2021-01-12T02:23:48Z - 2021-01-12 Incoming emails not being processed | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3306
- 2021-01-12T01:48:44Z - 2021-01-12 - GitLab Job has failed | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3305
- 2021-01-12T00:43:51Z - 2021-01-12 - Some repositories are in read-only mode | reliability~3760142 | ServicePraefect |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3304
CorrectiveAction Issues
- 2021-01-18T12:31:58Z - Add checklist for change management reviewers and approvers to runbooks with reference in handbook
- 2021-01-14T17:22:26Z - Monitor Google Cloud Storage as an independent service
- 2021-01-13T13:43:04Z - Create lifecycle for docker images and containers on runner-01-inf-ops
- 2021-01-13T10:27:39Z - Alert when latency recording rules reference histogram buckets that do not exist
- 2021-01-13T09:05:01Z - Consider opening incident issues for monitoring alerts
- 2021-01-12T20:42:32Z - Document and share the existence of GCP's Net Intelligence dashboards
- 2021-01-12T18:37:24Z - Exclude Canceled requests from the Gitaly error rate
- 2021-01-12T11:41:22Z - Add region (AZ) labels to VM infrastructure
Open Issue Stats
- Oncall issues : 11
- Change issues : 1
- Incident issues : 34
- Access Request : 3
- CorrectiveAction : 142
Open Change Issues
Show/Hide Table
Created | Summary |
---|---|
2021-01-11T16:50:32Z | Enable automated database reindexing (on workdays, once a day) |
Open Incident Issues
Show/Hide Table
Created | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2021-01-14T19:08:09Z | Project Import Request - skylight-tools: salesforce-integrations |
2021-01-14T19:07:13Z | Project Import Request - skylight-tools: renohub |
2021-01-14T19:04:10Z | Project Import Request - skylight-tools: cas |
2020-12-31T15:49:42Z | One-Time Export for MakeMeReach |
2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
2020-09-14T18:52:09Z | PS Congregate VM for GitHost to GitLab.com Migration - Afilias |
2020-09-02T13:47:51Z | disable-chef-client isn't preserved over reboots |
2020-08-11T16:39:37Z | Investigate slow child pipeline triggering on pre.gitlab.com |
2020-07-28T18:19:35Z | PS Congregate VM for BitBucket Server to GitLab.com Migration |
2020-07-28T17:43:40Z | Project Import Request - ciorg/bridge/am-child-pool/api |
2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
Edited by Alberto Ramos