OnCall report for period: 2019-08-13 - 2019-08-20
Oncall during this period
Schedule | Username |
---|---|
SRE | Alex Hanselka |
SRE | Craig Furman |
SRE | Nels Nelson |
PagerDuty Incidents
* Number of incidents: **43**
Show/Hide Table
Created | Summary |
---|---|
2019-08-13T09:48:12Z | [14764] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-13T09:53:44Z | [14765] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-13T13:39:51Z | [14766] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-13T14:00:52Z | [14767] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-14T08:27:22Z | [14771] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/ is down |
2019-08-14T08:29:57Z | [14772] Firing 1 - patroni-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-14T08:33:52Z | [14773] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-14T08:58:42Z | [14774] Firing 1 - Unused Replication Slots for patroni-04-db-gprd.c.gitlab-production.internal |
2019-08-14T09:03:38Z | [14775] Firing 2 - IncreasedErrorRateOtherBackends |
2019-08-14T09:06:58Z | [14776] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-14T09:28:57Z | [14777] Firing 1 - Postgres exporter is showing errors for the last hour |
2019-08-14T09:32:16Z | [14778] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-14T10:04:52Z | [14779] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2019-08-14T10:07:17Z | [14780] Firing 1 - Patroni is down |
2019-08-14T10:23:30Z | [14781] Firing 1 - Patroni is down |
2019-08-14T11:03:17Z | [14782] Firing 1 - Patroni is down |
2019-08-14T11:59:13Z | [14784] Firing 1 - Postgres Replication lag is over 2 minutes |
2019-08-14T11:59:13Z | [14783] Firing 1 - Postgres Replication lag (in bytes) is high |
2019-08-14T11:59:28Z | [14785] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-14T12:54:43Z | [14786] Firing 1 - Unused Replication Slots for patroni-04-db-gprd.c.gitlab-production.internal |
2019-08-14T13:04:37Z | [14787] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2019-08-14T13:12:39Z | [14788] Firing 1 - The public dashboard page is down |
2019-08-14T13:24:38Z | [14789] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/issues is down |
2019-08-14T14:01:54Z | [14790] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-14T17:32:15Z | [14791] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2019-08-14T17:32:32Z | [14792] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-15T09:27:14Z | [14796] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-15T09:27:14Z | [14795] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2019-08-15T10:06:17Z | [14797] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-15T10:26:58Z | [14798] Archival postgres replicas have stopped recovery |
2019-08-15T11:04:56Z | [14799] Firing 1 - Postgres exporter is showing errors for the last hour |
2019-08-15T15:24:45Z | [14800] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-15T21:57:42Z | [14801] Firing 1 - patroni-08-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-15T22:21:15Z | [14802] Firing 1 - Unused Replication Slots for patroni-04-db-gprd.c.gitlab-production.internal |
2019-08-15T22:56:42Z | [14803] Firing 1 - Postgres exporter is showing errors for the last hour |
2019-08-16T08:02:31Z | [14804] Firing 1 - postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-16T09:39:14Z | [14805] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-16T09:41:29Z | [14806] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2019-08-16T13:57:29Z | [14807] Firing 1 - postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-16T23:40:37Z | [14808] Firing 1 - High 4xx Error Rate on Docker Registry |
2019-08-18T22:09:50Z | [14809] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-19T21:03:21Z | [14810] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-20T00:45:14Z | [14811] Firing 1 - ops.gitlab.net is returning errors for 10m |
7 Day Issue Stats
- Oncall issues : 4
- Access Request : 0
- Change Issues : 6
- Incident Issues : 8
- CorrectiveAction Issues : 0
Change Issues
- 2019-08-19T20:34:59Z - Migrate Ops instance to CloudSQL - ahanselka
- 2019-08-16T12:04:45Z - Remove patroni-01 from the failover selection. - unassigned
- 2019-08-15T13:12:56Z - Experimentally apply pipeline configuration changes in production. - andrewn
- 2019-08-15T00:01:16Z - Change fastly configuration for about.gitlab.com to use GCS - alejandro
- 2019-08-14T19:42:03Z - Removal of unused configuration files in patroni nodes - unassigned
- 2019-08-13T21:05:45Z - Tweaking (decreasing) idle_in_transaction_session_timeout on Production - unassigned
Incident Issues
- 2019-08-16T07:52:17Z - expired certificate for *.ops.gitlab.net - craigf | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1064
- 2019-08-15T12:57:36Z - Corrupted index: index_events_on_project_id_and_id - craigf | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1060
- 2019-08-14T14:28:47Z - Thanos and Prometheus not responding under load - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1056
- 2019-08-14T08:42:25Z - 2019-08-14: increased latencies caused by a database failover - craigf | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1054
- 2019-08-13T13:45:28Z - 2019-08-13: Elevated web latency and errors - craigf | ~S3 | ~"Service:Web" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1052
- 2019-08-13T13:40:16Z - 2019-08-13: Elevated error rate from Github integration - craigf | ~S4 | ~"Service:Sidekiq" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1051
- 2019-08-13T09:55:12Z - 2019-08-13: Postgres: too many dead tuples and stale replication - craigf | ~S4 | ~"Service:Postgres" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1050
- 2019-08-13T08:38:49Z - Reports of delays in sending email - craigf | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1049
CorrectiveAction Issues
- 2019-08-15T19:42:43Z - Enable Event Logging for Route53 on AWS Cloudtrail for all GitLab orgs - unassigned
- 2019-08-15T09:11:28Z - Why did one read replica queue client connections after a failover? - craigf
- 2019-08-15T08:57:00Z - Old postgres primaries should rejoin cluster after failover without manual intervention - unassigned
- 2019-08-14T17:51:26Z - Communication during incidents - glopezfernandez
- 2019-08-14T17:03:34Z - Patroni runbooks and tooling - Finotto
- 2019-08-14T17:02:49Z - Patroni runbooks: add enable statistics - craigf
Open Issue Stats
- Oncall issues : 22
- Change issues : 15
- Incident issues : 0
- Access Request : 4
- CorrectiveAction : 89
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-08-19T20:34:59Z | ahanselka | Migrate Ops instance to CloudSQL |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | unassigned | Removal of unused configuration files in patroni nodes |
2019-08-13T21:05:45Z | unassigned | Tweaking (decreasing) idle_in_transaction_session_timeout on Production |
2019-08-06T10:23:45Z | adescoms | Force eager provisioning of GCP disks after size increase |
2019-08-02T18:52:15Z | nnelson | Migrate large projects off file-28-stor-gprd to file-36-stor-gprd |
2019-08-02T18:51:05Z | nnelson | Migrate large projects off file-24-stor-gprd to file-36-stor-gprd |
2019-08-01T21:29:30Z | nnelson | Migrate large projects off file-25-stor-gprd to file-36-stor-gprd |
2019-07-19T09:19:15Z | jarv | Reduce unicorn worker count on two web workers |
2019-07-16T16:23:18Z | cmcfarland | Enable pages access control setting in gitlab.rb |
2019-07-13T01:03:58Z | msmiley | Prevent residual HAProxy processes by setting hard-stop-after
|
2019-07-09T03:15:10Z | cmiskell | Enable camoproxy functionality |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-06-07T16:08:53Z | aamarsanaa | add more storage nodes and rebalance existing ones |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-08-16T12:30:32Z | unassigned | The GCP snapshot restore pipeline is broken for production |
2019-08-14T06:59:01Z | craigf | Add more besteffort workers to the Sidekiq fleet |
2019-08-13T12:49:50Z | craigf | Get archived log from sidekiq worker |
2019-08-08T17:13:45Z | unassigned | Identify limits that would have prevented recent incidents |
2019-08-08T15:19:47Z | hphilipps | RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow |
2019-08-05T19:49:35Z | ansdval | RCA: Note Creation on commit via API Calls Halted new_note Sidekiq Queue |
2019-08-01T15:58:00Z | hphilipps | RCA Gitaly latency below SLO |
2019-08-01T04:47:51Z | unassigned | update GitLab.com to not require 2FA on the Gitlab.com side for SAML Logins via Okta |
2019-07-31T13:01:20Z | hphilipps | Gitaly SLO alerts should go to pagerduty |
2019-07-31T12:55:40Z | hphilipps | Alert on gitaly file descriptors |
2019-07-30T09:03:49Z | unassigned | Update Snowplow credentials on customers.gitlab.com |
2019-07-29T07:52:37Z | unassigned | Console host: patroni hostname doesnt resolve through consul dns |
2019-07-17T09:44:07Z | andrewn | Identify and Kill Zombie Sidekiq Jobs |
2019-07-03T20:00:25Z | cmiskell | DNS: Wildcard record for "serverless-evaluation.sec.gitlab.net" |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-18T01:44:01Z | unassigned | Review load-balancing configuration for registry |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-06-11T11:30:31Z | unassigned | no alert for customer.gitlab.com being down |
2019-06-04T17:29:12Z | ahanselka | Whitelist Atlasssian IP address Space for API Calls |
2019-05-27T14:16:28Z | unassigned | Automate SSLMate Certificate grabs |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | unassigned | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith