OnCall report for period: 2017-12-19 - 2017-12-26

Oncall during this period

Schedule Username
AMA Ilya Frolov
AMA Jason Tevnan
EU Pablo Carranza
EU John Northrup
EU Ilya Frolov
EU John Jarvis

PagerDuty Incidents

  • Number of incidents: 16
Created Summary
2017-12-19T19:09:51Z [#1167] Postgres Replication lag is over 200MB
2017-12-20T09:47:11Z [#1168 (closed)] Pingdom check GitLab.com master branch is down
2017-12-20T10:21:01Z [#1169 (closed)] PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
2017-12-20T10:56:21Z [#1170 (closed)] PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
2017-12-20T11:17:08Z [#1171] PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
2017-12-20T19:16:55Z [#1172] No disk space left on / on prometheus-01.us-east1-d.gce.gitlab-runners.gitlab.net: 0%
2017-12-20T19:17:25Z [#1173] No disk space left on / on prometheus-01.us-east1-c.gce.gitlab-runners.gitlab.net: 0%
2017-12-20T19:46:10Z [#1174] No disk space left on / on prometheus-01.nyc1.do.gitlab-runners.gitlab.net: 0%
2017-12-20T22:58:40Z [#1175] No disk space left on / on prometheus-01.us-east1-d.gce.gitlab-runners.gitlab.net: 0%
2017-12-20T22:59:10Z [#1176 (closed)] No disk space left on / on prometheus-01.us-east1-c.gce.gitlab-runners.gitlab.net: 0%
2017-12-22T10:35:31Z [#1177 (closed)] Gitaly latency on nfs-file-11.stor.gitlab.com has been over 1m during the last 5m
2017-12-22T12:06:36Z [#1178 (closed)] PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
2017-12-22T13:41:55Z [#1179 (closed)] No disk space left on /opt/gitlab on runners-cache-5.gitlab.com: 997.3m%
2017-12-24T11:05:40Z [#1180 (closed)] CPU use percent is extremely high on db3.cluster.gitlab.com for the past 2 hours.
2017-12-25T07:31:28Z [#1181 (closed)] Gitaly latency on nfs-file-04.stor.gitlab.com has been over 1m during the last 5m
2017-12-25T07:54:03Z [#1182 (closed)] Gitaly latency on nfs-file-04.stor.gitlab.com has been over 1m during the last 5m

Issues

Stats for the last oncall period

  • Total number of oncall issues opened in the last on call shift: 10
    • Access Request: 0
    • Critical: 1
  • Total number of oncall issues closed in this milestone: 0
    • Access Request: 0
    • Critical: 0

Open OnCall Issues

  • Total number of open oncall issues: 11
    • Access Request: 0
    • Critical: 1
Created Assignee Summary
24 Dec 17 12:33 UTC unassigned High database load on primary database
21 Dec 17 16:51 UTC unassigned Database specialists should be on-call for database related problems
21 Dec 17 14:03 UTC unassigned Add alert for sequential reads
21 Dec 17 13:59 UTC unassigned Alert on errors in the pgbouncer log
21 Dec 17 11:33 UTC tmaczukin Add cleaning mechanism for runners-cache-X machines
21 Dec 17 09:25 UTC unassigned validate end-to-end artifact uploading on staging and enable on production
20 Dec 17 12:06 UTC unassigned webhooks broken after ssl update
12 Dec 17 18:54 UTC ahanselka Need account/access to OpenVAS security scanner
23 Nov 17 10:37 UTC unassigned Cleanup SSL certificates
15 Nov 17 10:34 UTC unassigned Detect and alarm on long-running orphan processes on sidekiq
06 Nov 17 14:30 UTC unassigned Alarms should go off when we fail to create azure snapshots

Weekly Ops

Web/Git/API p95 latency

Gitaly p95 latency

NFS timeouts

Sidekiq CPU

API CPU

Git CPU

Web CPU

This issue was automatically generated using oncall-robot-assistant