Weekly Reliability (SRE) Team Newsletter – On-call Period:2020-12-29 - 2021-01-05

Announcements

Welcome back to everyone who had a long holiday vacation! And, a huge thank you to everyone who covered on-call shifts over the holidays.

Engineering Week in Review Highlights:

Team Updates

Core Infrastructure

Datastores

Observability


On-Call During This Period

Schedule Username
SRE 8-hour Americas Cindy Pallares
SRE 8-hour Americas Nels Nelson
SRE 8-hour APAC Graeme Gillies
SRE 8-hour EMEA Michal Wasilewski

PagerDuty Incidents

* Number of incidents: **20** Show/Hide Table
Created Summary
2020-12-29T06:27:50Z [34042] Firing 1 - Increased Error Rate Across Fleet
2020-12-29T11:26:02Z [34051] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down
2020-12-29T11:26:04Z [34052] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down
2020-12-29T11:26:04Z [34053] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down
2020-12-29T11:26:20Z [34054] Firing 2 - IncreasedErrorRateOtherBackends
2020-12-29T11:26:35Z [34055] Firing 1 - High Error Rate on Front End Web
2020-12-29T11:27:16Z [34056] Firing 8 - BlackboxProbeFailures
2020-12-30T00:05:49Z [34091] Firing 1 - Last successful WAL-G basebackup was seen 42.87s ago for env gprd.
2020-12-30T03:07:47Z [34095] Firing 1 - GitLab Job has failed
2020-12-30T06:10:47Z [34097] Firing 1 - Last successful WAL-G basebackup was seen 48.95s ago for env gprd.
2020-12-30T09:20:08Z [34101] Firing 1 - prometheus is unreachable
2021-01-01T01:41:50Z [34131] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-01T03:18:20Z [34133] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-01T05:23:35Z [34139] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-01T07:11:05Z [34142] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-01T07:39:05Z [34144] Firing 2 - IncreasedErrorRateOtherBackends
2021-01-02T00:05:48Z [34155] Firing 1 - Last successful WAL-G basebackup was seen 42.77s ago for env gprd.
2021-01-02T01:26:32Z [34157] Firing 1 - GitLab Job has failed
2021-01-02T06:10:48Z [34159] Firing 1 - Last successful WAL-G basebackup was seen 48.86s ago for env gprd.
2021-01-04T14:56:15Z [34212] Firing 1 - Alertmanager is failing sending notifications

7 Day Issue Stats

  • Oncall issues : 0
  • Access Request : 0
  • Change Issues : 0
  • Incident Issues : 12
  • CorrectiveAction Issues : 0

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Show/Hide Table
Created Summary

Open Incident Issues

Show/Hide Table
Created Summary
2021-01-04T13:00:15Z 2021-01-04: Job logs slow to load on first view
2020-12-31T05:04:51Z 2020-12-31: Intermittent errors on clone on a few projects

Open Oncall Issues

Show/Hide Table
Created Summary
2020-12-18T22:29:14Z CI clones fail for repositories with a path ending in a period
2020-10-27T14:20:44Z One-Time Export for micro_x
2020-09-14T18:52:09Z PS Congregate VM for GitHost to GitLab.com Migration - Afilias
2020-09-02T13:47:51Z disable-chef-client isn't preserved over reboots
2020-08-11T16:39:37Z Investigate slow child pipeline triggering on pre.gitlab.com
2020-07-28T18:19:35Z PS Congregate VM for BitBucket Server to GitLab.com Migration
2020-07-28T17:43:40Z Project Import Request - ciorg/bridge/am-child-pool/api
2020-03-30T13:38:11Z jobs.gitlab.com cert expired unnoticed on 2020-03-28
Edited by Alberto Ramos