Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.

Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-19 - 2021-01-26

Announcements

Cookbook deprecation! The following two Elasticsearch cookbooks were not in use and have been archived:

The chef-server indicated that these were not in use, so this should not break any existing recipes. Please report any issues to #sre_observability.

- late-breaking Soft PCL in place for Jan 26th, see details here: https://gitlab.slack.com/archives/CB3LSMEJV/p1611668502142600

Engineering Week in Review Highlights:

Team Updates

Core Infrastructure

Core-Infra is taking a hack/catchup week: https://gitlab.com/groups/gitlab-com/gl-infra/-/milestones/122 - paying down some tech debt and some planinng for Next Q things.

Datastores

We continue to focus on kicking off the migration of "Gitaly unused free projects to HDDs", Alejandro and Ahmad made a lot of progress here and here.
We are preparing for the DB upgrade to v12 (with near-zero downtime approach), this project will kick off very soon - see the details here: &384 (closed)
We continue to progress the controls review of the DB Benchmarking env, with Compliance. We'll identify today they last pieces of SRE work in the env before we can get the green light from Compliance.
Our Patroni upgrade tests in staging worked well (they were slow due to Staging lack of availability) and we are ready to take the upgrade to production.

Observability

The team's effort remains in the same primary projects:

KAS readiness
Redis 6 Upgrade
Elasticsearch Upgrade
&366

We've temporarily deprioritized the Thanos k8s migration to accommodate interrupt work:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11655

The Jaeger production deployment remains blocked on IAP security concerns.

On-Call During This Period

Schedule	Username
SRE 8-hour Americas	Alex Hanselka
SRE 8-hour Americas	Cindy Pallares
SRE 8-hour APAC	Devin Sylva
SRE 8-hour EMEA	Henri Philipps

PagerDuty Incidents

* Number of incidents: **84**

Show/Hide Table

Created	Summary
2021-01-19T00:12:59Z	[34734] Firing 2 - BlackboxProbeFailures
2021-01-19T00:28:50Z	[34739] Firing 1 - Increased Error Rate Across Fleet
2021-01-19T03:28:46Z	[34751] Firing 1 - The prometheus_alert_sender SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-19T03:37:45Z	[34752] Firing 1 - The prometheus_alert_sender SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-19T14:52:45Z	[34776] Firing 1 - The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-19T17:09:05Z	[34778] Firing 1 - Increased Error Rate Across Fleet
2021-01-19T17:32:44Z	[34785] Firing 1 - Blackbox probes for https://int.gprd.gitlab.net/users/sign_in are failing.
2021-01-19T17:35:37Z	[34786] Firing 4 - IncreasedErrorRateOtherBackends
2021-01-19T17:40:43Z	[34795] Firing 1 - Postgres Replication lag (in bytes) is high
2021-01-19T18:27:20Z	[34802] Firing 1 - Increased Error Rate Across Fleet
2021-01-19T18:32:26Z	[34815] Firing 2 - PostgreSQL_ReplicationLagBytesTooLarge
2021-01-19T18:33:44Z	[34818] Firing 1 -
2021-01-19T18:35:56Z	[34819] Firing 2 - PostgreSQL_ReplicationLagTooLarge
2021-01-19T18:45:45Z	[34821] Firing 1 - The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-20T01:29:14Z	[34850] Firing 1 - Blackbox probes for https://gitlab.com/explore/projects/starred are failing.
2021-01-20T01:57:15Z	[34851] Firing 1 - Blackbox probes for sytses/test-2#1 are failing.
2021-01-20T14:17:06Z	[34911] Firing 1 - Increased Server Connection Errors
2021-01-20T14:17:06Z	[34912] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-20T14:25:05Z	[34913] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-20T14:25:06Z	[34914] Firing 1 - Increased Server Connection Errors
2021-01-20T15:39:45Z	[34919] Firing 1 - The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-20T15:41:45Z	[34920] Firing 1 - The public_dashboards_thanos_query SLI of the monitoring service (`main` stage) has an apdex violating SLO
2021-01-20T15:41:46Z	[34921] Firing 1 - The public_dashboards_thanos_query SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-20T15:41:46Z	[34922] Firing 1 - The trickster SLI of the monitoring service (`main` stage) has an apdex violating SLO
2021-01-20T15:41:46Z	[34923] Firing 1 - The trickster SLI of the monitoring service (`main` stage) has an error rate violating SLO
2021-01-20T16:13:32Z	[34929] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-20T16:37:32Z	[34932] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-20T17:52:39Z	[34938] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
2021-01-21T09:53:05Z	[34991] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-21T09:53:05Z	[34992] Firing 1 - Increased Server Connection Errors
2021-01-21T09:58:50Z	[34994] Firing 1 - Increased HAProxy Backend Connection Errors
2021-01-21T09:58:50Z	[34995] Firing 1 - Increased Server Connection Errors
2021-01-21T12:36:14Z	[35002] Firing 1 - Blackbox probes for https://staging.gitlab.com/gitlab-com/operations/issues/42 are failing.
2021-01-21T14:13:32Z	[35011] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-21T15:13:32Z	[35013] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-21T16:13:32Z	[35019] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2021-01-21T21:14:39Z	[35040] Pingdom check check:https://license.gitlab.com/users/sign_in is down
2021-01-21T21:15:45Z	[35041] Firing 1 - Blackbox probes for https://license.gitlab.com are failing.
2021-01-21T21:17:39Z	[35044] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
2021-01-22T06:18:47Z	[35065] Pingdom check check:https://gitlab-examples.gitlab.io/ is down
2021-01-22T06:19:12Z	[35066] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down
2021-01-22T06:19:16Z	[35067] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
2021-01-22T06:19:20Z	[35068] Firing 3 - IncreasedErrorRateOtherBackends
2021-01-22T06:19:37Z	[35069] Firing 1 - High Error Rate on Front End Web
2021-01-22T06:19:44Z	[35070] Firing 6 - BlackboxProbeFailures
2021-01-22T06:24:35Z	[35081] Firing 4 - IncreasedServerResponseErrors
2021-01-23T03:29:00Z	[35152] Firing 1 - Blackbox probes for https://registry.ops.gitlab.net are failing.
2021-01-23T03:30:14Z	[35153] Firing 1 - Blackbox probes for https://ops.gitlab.net/users/sign_in are failing.
2021-01-23T04:19:00Z	[35157] Firing 1 - Alertmanager is failing sending notifications
2021-01-23T07:04:45Z	[35170] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T07:40:15Z	[35172] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T07:42:45Z	[35173] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-23T08:34:00Z	[35176] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T08:39:00Z	[35177] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-23T09:36:30Z	[35182] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T11:46:30Z	[35184] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T11:48:00Z	[35185] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-23T12:45:01Z	[35188] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T12:53:01Z	[35189] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-23T13:07:00Z	[35192] Needing a second set of eyes on incident 3404
2021-01-23T13:46:45Z	[35195] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T13:47:00Z	[35196] Firing 1 - Redis Switch Master
2021-01-23T14:47:30Z	[35199] Firing 1 - Redis Switch Master
2021-01-23T14:57:32Z	[35200] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T15:48:00Z	[35205] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T15:49:30Z	[35206] Firing 1 - Redis Switch Master
2021-01-23T16:58:30Z	[35207] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T17:52:15Z	[35208] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T18:12:15Z	[35210] Firing 1 - Failed to collect Redis metrics Check the status of redis on `redis-sidekiq-02-db-gprd.c.gitlab-production.internal:9121` with `gitlab-ctl status`.
2021-01-23T18:53:15Z	[35211] Firing 1 - Redis master missing for redis-sidekiq
2021-01-23T18:57:00Z	[35212] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T20:03:15Z	[35214] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T20:33:45Z	[35215] Firing 1 - Alertmanager is failing sending notifications
2021-01-23T21:31:45Z	[35219] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
2021-01-23T22:15:15Z	[35221] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T22:25:31Z	[35222] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T22:34:45Z	[35223] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-23T22:40:30Z	[35224] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2021-01-23T22:40:46Z	[35225] Firing 1 - Alertmanager is failing sending notifications
2021-01-23T23:15:30Z	[35229] Firing 1 - Connection of Redis replicas to the master is flapping
2021-01-24T00:18:00Z	[35232] Firing 1 - Alertmanager is failing sending notifications
2021-01-24T11:44:35Z	[35246] Firing 5 - IncreasedServerResponseErrors
2021-01-24T12:01:05Z	[35247] Firing 5 - IncreasedServerResponseErrors
2021-01-25T22:48:16Z	[35306] Firing 1 - Alertmanager is failing sending notifications

7 Day Issue Stats

Oncall issues : 6
Access Request : 0
Change Issues : 4
Incident Issues : 42
CorrectiveAction Issues : 0

Change Issues

2021-01-25T10:31:09Z - Enable the ActionCable feature in production
2021-01-25T08:41:23Z - Move websockets traffic to the websockets deployment in production
2021-01-21T00:17:53Z - Proxy email.customers.gitlab.com domain though Cloudflare
2021-01-19T19:24:13Z - [staging] Downgrade patroni followed by a re-test of the ansible playbook to upgrade the pypi patroni package to v2.0.1

Incident Issues

2021-01-25T09:10:55Z - Thanos compaction has not run in 24 hours. | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3412
2021-01-24T11:50:14Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3409
2021-01-24T09:25:58Z - Thanos has failing storage operations | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3408
2021-01-23T23:52:28Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3407
2021-01-23T23:52:13Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3406
2021-01-23T21:31:46Z - 2021-01-23 Incoming emails not being processed | reliability~3760141 | ServiceMailroom | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3405
2021-01-23T08:44:10Z - 2021-01-23: memory usage burst in redis-sidekiq | reliability~3760140 | ServiceRedis | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3404
2021-01-22T19:12:24Z - 2021-01-22: Shared Runner Error SLI violating SLO | reliability~3760142 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3403
2021-01-22T09:45:13Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3401
2021-01-22T06:31:27Z - 2021-01-22: Brief glitch across all services | reliability~3760141 | ServiceGCP | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3400
2021-01-21T21:17:42Z - 2021-01-21: license prod down | reliability~3760139 | ServiceLicense | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3398
2021-01-21T17:24:07Z - 2021-01-21: Problematic query causing elevated error rates | reliability~3760142 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3397
2021-01-21T17:03:09Z - Prometheus has no targets | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3396
2021-01-21T17:03:08Z - Prometheus has no targets | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3395
2021-01-21T16:22:41Z - 2021-01-21: puma and workhorse SLI violating SLO | reliability~3760140 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3394
2021-01-21T10:27:25Z - 2021-01-21: The workhorse SLI of the web service has an apdex violating SLO | reliability~3760141 | ~"Service::Workhorse" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3391
2021-01-21T09:04:49Z - 2021-01-21: migration time out - 20210115084949_add_repository_read_only_to_groups | reliability~3760140 | ServiceGitLab Rails | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3390
2021-01-21T05:53:28Z - Thanos compaction halted | | reliability~13296694 | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3389
2021-01-21T05:53:13Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3388
2021-01-21T03:30:47Z - 2021-01-21: Elevated 5xx errors for https_git | reliability~3760142 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3387
2021-01-20T20:46:52Z - 2021-01-20: Canary API Workhorse SLO violation | reliability~3760141 | ~"Service::Workhorse" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3384
2021-01-20T17:52:41Z - 2021-01-20: Missing error ratio monitoring on Gitaly nodes | reliability~3760142 | ~"Service::Monitoring" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3381
2021-01-20T17:50:37Z - 2021-01-20 Clone of gitlab-org/gitlab is timing out, preventing release tagging | reliability~3760142 | ServiceInfrastructure | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3380
2021-01-20T16:11:15Z - GitLab Org Docker Shared runners can't download/upload cache | reliability~3760142 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3379
2021-01-20T15:51:16Z - The public_dashboards_thanos_query SLI of the monitoring service (main stage) has an error rate violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3378
2021-01-20T15:51:15Z - The trickster SLI of the monitoring service (main stage) has an error rate violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3377
2021-01-20T15:51:15Z - The trickster SLI of the monitoring service (main stage) has an apdex violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3376
2021-01-20T15:51:15Z - The public_dashboards_thanos_query SLI of the monitoring service (main stage) has an apdex violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3375
2021-01-20T15:49:15Z - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3374
2021-01-20T11:52:26Z - 2021-01-20: Deployments blocked by code change failing QA tests | reliability~3760140 | ServiceGitLab Rails | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3373
2021-01-19T17:21:59Z - 2021-01-19 - Elevated Error rates across fleet, workhorse | reliability~3760139 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3370
2021-01-19T14:16:43Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3368
2021-01-19T14:16:43Z - Thanos compaction halted | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3367
2021-01-19T14:09:54Z - 2021-01-19: The nfs service (main stage) is receiving more requests than normal | reliability~3760142 | ServicePages | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3366
2021-01-19T08:49:35Z - Prometheus has slow rule evaluations | | reliability~13296694 | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3365
2021-01-19T08:49:35Z - Prometheus has slow rule evaluations | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3364
2021-01-19T07:49:35Z - Prometheus has slow rule evaluations | | reliability~13296694 | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3363
2021-01-19T03:38:15Z - The prometheus_alert_sender SLI of the monitoring service (main stage) has an error rate violating SLO | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3362
2021-01-19T03:29:28Z - Thanos compaction halted | reliability~3760142 | ServicePrometheus | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3361
2021-01-19T01:00:59Z - 2021-01-19: Spike in API calls triggered pager | reliability~3760142 | ServiceGitLab Rails | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3360
2021-01-19T00:45:20Z - Prometheus has slow rule evaluations | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3359
2021-01-19T00:45:20Z - Prometheus has slow rule evaluations | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3358

CorrectiveAction Issues

2021-01-23T22:50:29Z - Corrective action/cleanup - get Redis-Sidekiq nodes okay on memsize in Terraform
2021-01-21T09:52:39Z - Long running database transaction monitoring
2021-01-20T09:19:43Z - Investigate the cause of connections being terminated in HAProxy
2021-01-19T17:22:14Z - Web-Pages HAProxy logs need more visibility
2021-01-19T11:56:51Z - Add apdex and error metrics to the git/gitlab_shell SLI
2021-01-19T00:01:15Z - Upgrade all GKE clusters to 1.18

Open Issue Stats

Open Change Issues

Show/Hide Table

Created	Summary
2021-01-25T10:31:09Z	Enable the ActionCable feature in production
2021-01-11T16:50:32Z	Enable automated database reindexing (on workdays, once a day)

Open Incident Issues

Show/Hide Table

Created	Summary
2021-01-21T17:24:07Z	2021-01-21: Problematic query causing elevated error rates
2021-01-20T17:50:37Z	2021-01-20 Clone of gitlab-org/gitlab is timing out, preventing release tagging

Open Oncall Issues

Show/Hide Table

Created	Summary
2021-01-19T18:44:03Z	Project Import Request - privatestorage: notifyme-backend
2021-01-19T18:42:40Z	Project Import Request - privatestorage: qatools
2021-01-19T18:41:09Z	Project Import Request - privatestorage: privatestorageweb
2021-01-19T18:39:53Z	Project Import Request - privatestorage: privatestorageops
2021-01-19T18:37:52Z	Project Import Request - privatestorage: privatestoragedesktop
2021-01-19T18:31:55Z	Project Import Request - privatestorage: bizops
2021-01-14T19:08:09Z	Project Import Request - skylight-tools: salesforce-integrations
2021-01-14T19:07:13Z	Project Import Request - skylight-tools: renohub
2021-01-14T19:04:10Z	Project Import Request - skylight-tools: cas
2020-12-31T15:49:42Z	One-Time Export for MakeMeReach
2020-12-18T22:29:14Z	CI clones fail for repositories with a path ending in a period
2020-09-14T18:52:09Z	PS Congregate VM for GitHost to GitLab.com Migration - Afilias
2020-09-02T13:47:51Z	disable-chef-client isn't preserved over reboots
2020-08-11T16:39:37Z	Investigate slow child pipeline triggering on pre.gitlab.com
2020-07-28T18:19:35Z	PS Congregate VM for BitBucket Server to GitLab.com Migration
2020-07-28T17:43:40Z	Project Import Request - ciorg/bridge/am-child-pool/api
2020-03-30T13:38:11Z	jobs.gitlab.com cert expired unnoticed on 2020-03-28

Edited 4 years ago by Brent Newton

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-19 - 2021-01-26

Announcements

Engineering Week in Review Highlights:

Team Updates

Core Infrastructure

Datastores

Observability

On-Call During This Period

PagerDuty Incidents

7 Day Issue Stats

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Open Incident Issues

Open Oncall Issues

Child items ...

Activity

Weekly Reliability (SRE) Team Newsletter – On-call Period:2021-01-19 - 2021-01-26

Announcements

Engineering Week in Review Highlights:

Team Updates

Core Infrastructure

Datastores

Observability

On-Call During This Period

PagerDuty Incidents

7 Day Issue Stats

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Open Incident Issues

Open Oncall Issues

Relates to

Activity