Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-19 - 2021-01-26
Announcements
- https://gitlab.com/gitlab-cookbooks/deprecated-gitlab-elastic/
- https://gitlab.com/gitlab-cookbooks/deprecated-gitlab_elasticsearch
The chef-server indicated that these were not in use, so this should not break any existing recipes. Please report any issues to #sre_observability
.
Engineering Week in Review Highlights:
Team Updates
Core Infrastructure
Core-Infra is taking a hack/catchup week: https://gitlab.com/groups/gitlab-com/gl-infra/-/milestones/122 - paying down some tech debt and some planinng for Next Q things.
Datastores
- We continue to focus on kicking off the migration of "Gitaly unused free projects to HDDs", Alejandro and Ahmad made a lot of progress here and here.
- We are preparing for the DB upgrade to v12 (with near-zero downtime approach), this project will kick off very soon - see the details here: &384 (closed)
- We continue to progress the controls review of the DB Benchmarking env, with Compliance. We'll identify today they last pieces of SRE work in the env before we can get the green light from Compliance.
- Our Patroni upgrade tests in staging worked well (they were slow due to Staging lack of availability) and we are ready to take the upgrade to production.
Observability
The team's effort remains in the same primary projects:
- KAS readiness
- Redis 6 Upgrade
- Elasticsearch Upgrade
- &366
We've temporarily deprioritized the Thanos k8s migration to accommodate interrupt work:
The Jaeger production deployment remains blocked on IAP security concerns.
- readiness!59 (comment 468565731)
- https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/8181
On-Call During This Period
Schedule | Username |
---|---|
SRE 8-hour Americas | Alex Hanselka |
SRE 8-hour Americas | Cindy Pallares |
SRE 8-hour APAC | Devin Sylva |
SRE 8-hour EMEA | Henri Philipps |
PagerDuty Incidents
Show/Hide Table
Created | Summary |
---|---|
2021-01-19T00:12:59Z | [34734] Firing 2 - BlackboxProbeFailures |
2021-01-19T00:28:50Z | [34739] Firing 1 - Increased Error Rate Across Fleet |
2021-01-19T03:28:46Z | [34751] Firing 1 - The prometheus_alert_sender SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-19T03:37:45Z | [34752] Firing 1 - The prometheus_alert_sender SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-19T14:52:45Z | [34776] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-19T17:09:05Z | [34778] Firing 1 - Increased Error Rate Across Fleet |
2021-01-19T17:32:44Z | [34785] Firing 1 - Blackbox probes for https://int.gprd.gitlab.net/users/sign_in are failing. |
2021-01-19T17:35:37Z | [34786] Firing 4 - IncreasedErrorRateOtherBackends |
2021-01-19T17:40:43Z | [34795] Firing 1 - Postgres Replication lag (in bytes) is high |
2021-01-19T18:27:20Z | [34802] Firing 1 - Increased Error Rate Across Fleet |
2021-01-19T18:32:26Z | [34815] Firing 2 - PostgreSQL_ReplicationLagBytesTooLarge |
2021-01-19T18:33:44Z | [34818] Firing 1 - |
2021-01-19T18:35:56Z | [34819] Firing 2 - PostgreSQL_ReplicationLagTooLarge |
2021-01-19T18:45:45Z | [34821] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-20T01:29:14Z | [34850] Firing 1 - Blackbox probes for https://gitlab.com/explore/projects/starred are failing. |
2021-01-20T01:57:15Z | [34851] Firing 1 - Blackbox probes for sytses/test-2#1 are failing. |
2021-01-20T14:17:06Z | [34911] Firing 1 - Increased Server Connection Errors |
2021-01-20T14:17:06Z | [34912] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-20T14:25:05Z | [34913] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-20T14:25:06Z | [34914] Firing 1 - Increased Server Connection Errors |
2021-01-20T15:39:45Z | [34919] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-20T15:41:45Z | [34920] Firing 1 - The public_dashboards_thanos_query SLI of the monitoring service (main stage) has an apdex violating SLO |
2021-01-20T15:41:46Z | [34921] Firing 1 - The public_dashboards_thanos_query SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-20T15:41:46Z | [34922] Firing 1 - The trickster SLI of the monitoring service (main stage) has an apdex violating SLO |
2021-01-20T15:41:46Z | [34923] Firing 1 - The trickster SLI of the monitoring service (main stage) has an error rate violating SLO |
2021-01-20T16:13:32Z | [34929] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-20T16:37:32Z | [34932] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-20T17:52:39Z | [34938] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416 |
2021-01-21T09:53:05Z | [34991] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-21T09:53:05Z | [34992] Firing 1 - Increased Server Connection Errors |
2021-01-21T09:58:50Z | [34994] Firing 1 - Increased HAProxy Backend Connection Errors |
2021-01-21T09:58:50Z | [34995] Firing 1 - Increased Server Connection Errors |
2021-01-21T12:36:14Z | [35002] Firing 1 - Blackbox probes for https://staging.gitlab.com/gitlab-com/operations/issues/42 are failing. |
2021-01-21T14:13:32Z | [35011] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-21T15:13:32Z | [35013] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-21T16:13:32Z | [35019] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2021-01-21T21:14:39Z | [35040] Pingdom check check:https://license.gitlab.com/users/sign_in is down |
2021-01-21T21:15:45Z | [35041] Firing 1 - Blackbox probes for https://license.gitlab.com are failing. |
2021-01-21T21:17:39Z | [35044] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416 |
2021-01-22T06:18:47Z | [35065] Pingdom check check:https://gitlab-examples.gitlab.io/ is down |
2021-01-22T06:19:12Z | [35066] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2021-01-22T06:19:16Z | [35067] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2021-01-22T06:19:20Z | [35068] Firing 3 - IncreasedErrorRateOtherBackends |
2021-01-22T06:19:37Z | [35069] Firing 1 - High Error Rate on Front End Web |
2021-01-22T06:19:44Z | [35070] Firing 6 - BlackboxProbeFailures |
2021-01-22T06:24:35Z | [35081] Firing 4 - IncreasedServerResponseErrors |
2021-01-23T03:29:00Z | [35152] Firing 1 - Blackbox probes for https://registry.ops.gitlab.net are failing. |
2021-01-23T03:30:14Z | [35153] Firing 1 - Blackbox probes for https://ops.gitlab.net/users/sign_in are failing. |
2021-01-23T04:19:00Z | [35157] Firing 1 - Alertmanager is failing sending notifications |
2021-01-23T07:04:45Z | [35170] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T07:40:15Z | [35172] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T07:42:45Z | [35173] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-23T08:34:00Z | [35176] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T08:39:00Z | [35177] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-23T09:36:30Z | [35182] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T11:46:30Z | [35184] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T11:48:00Z | [35185] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-23T12:45:01Z | [35188] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T12:53:01Z | [35189] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-23T13:07:00Z | [35192] Needing a second set of eyes on incident 3404 |
2021-01-23T13:46:45Z | [35195] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T13:47:00Z | [35196] Firing 1 - Redis Switch Master |
2021-01-23T14:47:30Z | [35199] Firing 1 - Redis Switch Master |
2021-01-23T14:57:32Z | [35200] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T15:48:00Z | [35205] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T15:49:30Z | [35206] Firing 1 - Redis Switch Master |
2021-01-23T16:58:30Z | [35207] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T17:52:15Z | [35208] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T18:12:15Z | [35210] Firing 1 - Failed to collect Redis metrics Check the status of redis on redis-sidekiq-02-db-gprd.c.gitlab-production.internal:9121 with gitlab-ctl status . |
2021-01-23T18:53:15Z | [35211] Firing 1 - Redis master missing for redis-sidekiq |
2021-01-23T18:57:00Z | [35212] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T20:03:15Z | [35214] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T20:33:45Z | [35215] Firing 1 - Alertmanager is failing sending notifications |
2021-01-23T21:31:45Z | [35219] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416 |
2021-01-23T22:15:15Z | [35221] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T22:25:31Z | [35222] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T22:34:45Z | [35223] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-23T22:40:30Z | [35224] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2021-01-23T22:40:46Z | [35225] Firing 1 - Alertmanager is failing sending notifications |
2021-01-23T23:15:30Z | [35229] Firing 1 - Connection of Redis replicas to the master is flapping |
2021-01-24T00:18:00Z | [35232] Firing 1 - Alertmanager is failing sending notifications |
2021-01-24T11:44:35Z | [35246] Firing 5 - IncreasedServerResponseErrors |
2021-01-24T12:01:05Z | [35247] Firing 5 - IncreasedServerResponseErrors |
2021-01-25T22:48:16Z | [35306] Firing 1 - Alertmanager is failing sending notifications |
7 Day Issue Stats
- Oncall issues : 6
- Access Request : 0
- Change Issues : 4
- Incident Issues : 42
- CorrectiveAction Issues : 0
Change Issues
- 2021-01-25T10:31:09Z - Enable the ActionCable feature in production
- 2021-01-25T08:41:23Z - Move websockets traffic to the websockets deployment in production
- 2021-01-21T00:17:53Z - Proxy email.customers.gitlab.com domain though Cloudflare
- 2021-01-19T19:24:13Z - [staging] Downgrade patroni followed by a re-test of the ansible playbook to upgrade the pypi patroni package to v2.0.1
Incident Issues
- 2021-01-25T09:10:55Z - Thanos compaction has not run in 24 hours. | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3412
- 2021-01-24T11:50:14Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3409
- 2021-01-24T09:25:58Z - Thanos has failing storage operations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3408
- 2021-01-23T23:52:28Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3407
- 2021-01-23T23:52:13Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3406
- 2021-01-23T21:31:46Z - 2021-01-23 Incoming emails not being processed | reliability~3760141 | ServiceMailroom |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3405
- 2021-01-23T08:44:10Z - 2021-01-23: memory usage burst in redis-sidekiq | reliability~3760140 | ServiceRedis |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3404
- 2021-01-22T19:12:24Z - 2021-01-22: Shared Runner Error SLI violating SLO | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3403
- 2021-01-22T09:45:13Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3401
- 2021-01-22T06:31:27Z - 2021-01-22: Brief glitch across all services | reliability~3760141 | ServiceGCP |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3400
- 2021-01-21T21:17:42Z - 2021-01-21: license prod down | reliability~3760139 | ServiceLicense |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3398
- 2021-01-21T17:24:07Z - 2021-01-21: Problematic query causing elevated error rates | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3397
- 2021-01-21T17:03:09Z - Prometheus has no targets | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3396
- 2021-01-21T17:03:08Z - Prometheus has no targets | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3395
- 2021-01-21T16:22:41Z - 2021-01-21: puma and workhorse SLI violating SLO | reliability~3760140 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3394
- 2021-01-21T10:27:25Z - 2021-01-21: The workhorse SLI of the web service has an apdex violating SLO | reliability~3760141 | ~"Service::Workhorse" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3391
- 2021-01-21T09:04:49Z - 2021-01-21: migration time out - 20210115084949_add_repository_read_only_to_groups | reliability~3760140 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3390
- 2021-01-21T05:53:28Z - Thanos compaction halted | | reliability~13296694 |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3389
- 2021-01-21T05:53:13Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3388
- 2021-01-21T03:30:47Z - 2021-01-21: Elevated 5xx errors for https_git | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3387
- 2021-01-20T20:46:52Z - 2021-01-20: Canary API Workhorse SLO violation | reliability~3760141 | ~"Service::Workhorse" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3384
- 2021-01-20T17:52:41Z - 2021-01-20: Missing error ratio monitoring on Gitaly nodes | reliability~3760142 | ~"Service::Monitoring" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3381
- 2021-01-20T17:50:37Z - 2021-01-20 Clone of gitlab-org/gitlab is timing out, preventing release tagging | reliability~3760142 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3380
- 2021-01-20T16:11:15Z - GitLab Org Docker Shared runners can't download/upload cache | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3379
- 2021-01-20T15:51:16Z - The public_dashboards_thanos_query SLI of the monitoring service (
main
stage) has an error rate violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3378
- 2021-01-20T15:51:15Z - The trickster SLI of the monitoring service (
main
stage) has an error rate violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3377
- 2021-01-20T15:51:15Z - The trickster SLI of the monitoring service (
main
stage) has an apdex violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3376
- 2021-01-20T15:51:15Z - The public_dashboards_thanos_query SLI of the monitoring service (
main
stage) has an apdex violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3375
- 2021-01-20T15:49:15Z - The grafana SLI of the monitoring service (
main
stage) has an error rate violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3374
- 2021-01-20T11:52:26Z - 2021-01-20: Deployments blocked by code change failing QA tests | reliability~3760140 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3373
- 2021-01-19T17:21:59Z - 2021-01-19 - Elevated Error rates across fleet, workhorse | reliability~3760139 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3370
- 2021-01-19T14:16:43Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3368
- 2021-01-19T14:16:43Z - Thanos compaction halted | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3367
- 2021-01-19T14:09:54Z - 2021-01-19: The nfs service (main stage) is receiving more requests than normal | reliability~3760142 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3366
- 2021-01-19T08:49:35Z - Prometheus has slow rule evaluations | | reliability~13296694 |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3365
- 2021-01-19T08:49:35Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3364
- 2021-01-19T07:49:35Z - Prometheus has slow rule evaluations | | reliability~13296694 |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3363
- 2021-01-19T03:38:15Z - The prometheus_alert_sender SLI of the monitoring service (
main
stage) has an error rate violating SLO | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3362
- 2021-01-19T03:29:28Z - Thanos compaction halted | reliability~3760142 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3361
- 2021-01-19T01:00:59Z - 2021-01-19: Spike in API calls triggered pager | reliability~3760142 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3360
- 2021-01-19T00:45:20Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3359
- 2021-01-19T00:45:20Z - Prometheus has slow rule evaluations | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3358
CorrectiveAction Issues
- 2021-01-23T22:50:29Z - Corrective action/cleanup - get Redis-Sidekiq nodes okay on memsize in Terraform
- 2021-01-21T09:52:39Z - Long running database transaction monitoring
- 2021-01-20T09:19:43Z - Investigate the cause of connections being terminated in HAProxy
- 2021-01-19T17:22:14Z - Web-Pages HAProxy logs need more visibility
- 2021-01-19T11:56:51Z - Add apdex and error metrics to the git/gitlab_shell SLI
- 2021-01-19T00:01:15Z - Upgrade all GKE clusters to 1.18
Open Issue Stats
- Oncall issues : 17
- Change issues : 2
- Incident issues : 39
- Access Request : 2
- CorrectiveAction : 146
Open Change Issues
Show/Hide Table
Created | Summary |
---|---|
2021-01-25T10:31:09Z | Enable the ActionCable feature in production |
2021-01-11T16:50:32Z | Enable automated database reindexing (on workdays, once a day) |
Open Incident Issues
Show/Hide Table
Created | Summary |
---|---|
2021-01-21T17:24:07Z | 2021-01-21: Problematic query causing elevated error rates |
2021-01-20T17:50:37Z | 2021-01-20 Clone of gitlab-org/gitlab is timing out, preventing release tagging |
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2021-01-19T18:44:03Z | Project Import Request - privatestorage: notifyme-backend |
2021-01-19T18:42:40Z | Project Import Request - privatestorage: qatools |
2021-01-19T18:41:09Z | Project Import Request - privatestorage: privatestorageweb |
2021-01-19T18:39:53Z | Project Import Request - privatestorage: privatestorageops |
2021-01-19T18:37:52Z | Project Import Request - privatestorage: privatestoragedesktop |
2021-01-19T18:31:55Z | Project Import Request - privatestorage: bizops |
2021-01-14T19:08:09Z | Project Import Request - skylight-tools: salesforce-integrations |
2021-01-14T19:07:13Z | Project Import Request - skylight-tools: renohub |
2021-01-14T19:04:10Z | Project Import Request - skylight-tools: cas |
2020-12-31T15:49:42Z | One-Time Export for MakeMeReach |
2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
2020-09-14T18:52:09Z | PS Congregate VM for GitHost to GitLab.com Migration - Afilias |
2020-09-02T13:47:51Z | disable-chef-client isn't preserved over reboots |
2020-08-11T16:39:37Z | Investigate slow child pipeline triggering on pre.gitlab.com |
2020-07-28T18:19:35Z | PS Congregate VM for BitBucket Server to GitLab.com Migration |
2020-07-28T17:43:40Z | Project Import Request - ciorg/bridge/am-child-pool/api |
2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |