OnCall report for period: 2020-04-28 - 2020-05-05
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Craig Furman |
SRE 8 Hour | Nels Nelson |
PagerDuty Incidents
* Number of incidents: **44**
Show/Hide Table
Created | Summary |
---|---|
2020-04-28T07:23:38Z | [20230] Firing 2 - IncreasedServerResponseErrors |
2020-04-28T13:32:36Z | [20236] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-28T14:52:22Z | [20237] Firing 2 - PrometheusManyRestarts |
2020-04-28T15:36:34Z | [20238] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-28T15:36:34Z | [20239] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-28T18:01:05Z | [20241] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-28T18:01:05Z | [20240] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-28T18:07:02Z | [20242] Active Distributed Cred Stuffing Attack |
2020-04-28T18:16:19Z | [20243] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-29T06:34:58Z | [20256] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-29T13:10:51Z | [20264] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-29T14:11:22Z | [20267] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-29T14:52:36Z | [20268] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-29T15:45:30Z | [20271] Firing 1 - Chef client failures have reached critical levels |
2020-04-29T16:20:19Z | [20274] HTTP500's during CI job artifact uploads |
2020-04-29T16:20:20Z | [20275] HTTP500's during CI job artifact uploads |
2020-04-29T16:56:51Z | [20276] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-29T22:40:52Z | [20279] Security Incident - rotating DNS mgmt in AWS |
2020-04-30T00:04:22Z | [20281] Firing 2 - PrometheusManyRestarts |
2020-04-30T10:15:37Z | [20297] This is a manual paging test. |
2020-04-30T12:28:51Z | [20300] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-30T12:50:50Z | [20302] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-30T15:48:25Z | [20310] Pingdom check check:https://gitlab-examples.gitlab.io/ is down |
2020-04-30T18:40:58Z | [20314] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-30T21:44:28Z | [20315] Need to clear out password expiries for ~4k users |
2020-05-01T00:06:58Z | [20317] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-01T13:09:58Z | [20330] Firing 1 - Last WALE backup was seen 20m 4s ago. |
2020-05-01T16:18:58Z | [20332] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-01T17:01:23Z | [20333] Firing 1 - 5% disk space left |
2020-05-01T17:32:26Z | [20335] Firing 1 - 5% disk space left |
2020-05-01T20:56:50Z | [20338] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-01T21:30:51Z | [20339] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-01T22:51:50Z | [20340] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-03T00:30:31Z | [20361] Firing 1 - SSL certificate for https://githost.io expires in 23h 29m 58s |
2020-05-04T03:51:58Z | [20384] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-04T16:41:34Z | [20401] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-04T16:41:34Z | [20402] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-04T17:56:04Z | [20403] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-04T17:56:05Z | [20404] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-04T19:03:58Z | [20406] Firing 1 - HPA unable to scale up |
2020-05-04T19:22:45Z | [20407] Accidentally deleted user |
2020-05-04T21:26:29Z | [20410] Firing 1 - HPA unable to scale up |
2020-05-04T23:43:38Z | [20413] Firing 1 - 5% disk space left |
2020-05-04T23:52:29Z | [20414] Firing 1 - HPA unable to scale up |
7 Day Issue Stats
- Oncall issues : 4
- Access Request : 0
- Change Issues : 1
- Incident Issues : 42
- CorrectiveAction Issues : 0
Change Issues
- 2020-04-28T17:57:12Z - Create new gitaly storage shard node
file-51-stor-gprd
to replacefile-41-stor-gprd
in the configured rotation for storing new projects - nnelson
Incident Issues
- 2020-05-05T00:51:56Z - UA challenge on CloudFlare - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2068
- 2020-05-04T19:32:36Z - 2020-05-04: Accidentally deleted user - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2067
- 2020-05-04T19:10:00Z - 2020-05-04: HPA unable to scale up - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2066
- 2020-05-04T18:37:34Z - staging.GitLab.com is down for 30 minutes - nnelson | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2065
- 2020-05-04T16:32:20Z - 2020-05-04: Anomaly detection: The
web-pages
service (main
stage) is receiving more requests than normal - nnelson | ~S4 | ServicePages |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2063
- 2020-05-04T14:53:11Z - 2020-05-04: mail queues building up - nnelson | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2062
- 2020-05-04T13:49:00Z - 2020-05-04: pages elevated error rate - craigf | ~S4 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2061
- 2020-05-04T10:31:56Z - 2020-05-04: possible data loss via external diffs migration - ahmadsherif | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2060
- 2020-05-04T08:52:01Z - 2020-05-04: Rails logs not loading - craigf | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2059
- 2020-05-04T04:00:29Z - 2020-05-04 The
patroni
service (main
stage) has a apdex score (latency) below SLO - cmiskell | ~S4 | ServiceSidekiq |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2057
- 2020-05-03T17:06:52Z - The
postgres-archive
service (main
stage),cpu
component has a saturation exceeding SLO and is close to its capacity limit - nnelson | ~S4 | ServicePostgres |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2056
- 2020-05-03T01:21:42Z - githost.io certificate about to expire - cmiskell | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2055
- 2020-05-02T00:14:12Z - 2020-05-02 The <code data-sourcepos="89:43-89:45">web</code> service, <code data-sourcepos="89:58-89:61">puma</code> component, <code data-sourcepos="89:76-89:78">cny</code> stage, has an error burn-rate exceeding SLO - cmiskell | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2054
- 2020-05-01T21:02:35Z - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m - nnelson | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2051
- 2020-05-01T17:09:33Z - 5% disk space left on multiple api nodes - nnelson | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2050
- 2020-05-01T16:43:36Z - The
patroni
service (main
stage) has a apdex score (latency) below SLO - nnelson | ~S4 | ServicePatroni |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2049
- 2020-05-01T13:16:00Z - 2020-05-01: postgres WAL-E is apparently delayed - craigf | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2048
- 2020-05-01T04:20:34Z - 2020-05-01 Workhorse + puma 1h SLO burn breached - cmiskell | ~S4 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2047
- 2020-05-01T00:19:46Z - 2020-05-01 The
patroni
service (main
stage) has a apdex score (latency) below SLO - cmiskell | ~S4 | ServicePostgres |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2046
- 2020-04-30T19:25:41Z - 2020-04-30 - The
workhorse
component of theweb
service, (cny
stage), has a apdex-score burn rate outside of SLO - nnelson | ~S4 | ServiceWeb |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2044
- 2020-04-30T16:26:15Z - 2020-04-30 - The
workhorse
component of theweb
service, (cny
stage), has a apdex-score burn rate outside of SLO - nnelson | ~S4 | ServiceWeb |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2043
- 2020-04-30T15:51:46Z - 2020-04-30 - Pingdom check: https://gitlab-examples.gitlab.io/ is down - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2042
- 2020-04-30T15:39:31Z - 2020-04-30 - Workhorse and web-pages SLO violations possibly from a web crawler - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2041
- 2020-04-30T08:28:09Z - 2020-04-30: Some slow workhorse requests - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2039
- 2020-04-30T08:28:07Z - 2020-04-30: slightly degraded web cny latency - craigf | ~S4 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2038
- 2020-04-30T06:59:28Z - 2020-04-30 Workhorse canary component apdex-score burn rate outside of SLO - cmiskell | ~S4 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2037
- 2020-04-30T03:22:34Z - 2020-04-30 Sidekiq apdex drop - cmiskell | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2035
- 2020-04-30T00:16:58Z - 2020-04-30 Thanos OOM (cgroup limit) on production - cmiskell | ~S4 | ServiceThanos |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2034
- 2020-04-29T22:41:54Z - Route53 Access Key Leak - Rotate Access Key - cmiskell | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2033
- 2020-04-29T16:24:56Z - 2020-04-29 - Large number of overdue pull mirror jobs - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2032
- 2020-04-29T16:20:09Z - HTTP500's during CI job artifact uploads - nnelson | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2031
- 2020-04-29T15:49:10Z - 2020-04-29 - Chef client failures have reached critical levels - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2030
- 2020-04-29T15:43:40Z - 2020-04-29 - The
pages
service (main
stage) has an error-ratio exceeding SLO - nnelson | ~S4 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2029
- 2020-04-29T13:18:12Z - 2020-04-29: Delayed pull mirrors - craigf | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2027
- 2020-04-29T13:18:09Z - 2020-04-29: Delayed pull mirrors - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2026
- 2020-04-29T10:05:27Z - 2020-04-29: Elastic cloud prod log cluster health is yellow - unassigned | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2025
- 2020-04-29T09:33:50Z - 2020-04-29: log ingestion is lagging - craigf | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2024
- 2020-04-29T09:33:49Z - 2020-04-29: log ingestion has stopped - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2023
- 2020-04-29T07:44:04Z - 2020-04-29 The
patroni
service (main
stage) has a apdex score (latency) below SLO - cmiskell | ~S4 | ServicePostgres |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2022
- 2020-04-28T18:46:05Z - 2020-04-28: Credential stuffing distributed attack on-going - nnelson | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2020
- 2020-04-28T13:28:28Z - 2020-04-28: elevated sidekiq job latency - craigf | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2016
- 2020-04-28T07:43:17Z - 2020-04-28: Error burst from git ssh backends - jarv | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2015
CorrectiveAction Issues
- 2020-05-04T16:43:52Z - investigate charts on the monitoring dashboard missing data - mwasilewski-gitlab
- 2020-04-30T13:31:20Z - Three Typos to be corrected in Change Management Page - unassigned
- 2020-04-29T15:35:24Z - document in runbooks in /docs/gitaly the use of housekeeping button for a lot of upload-packs processes - mwasilewski-gitlab
- 2020-04-29T11:39:43Z - pgbouncer process exporter group is not working on pgbouncer production nodes - unassigned
- 2020-04-29T09:00:32Z - tune down the Elastic write queue rejections alert - mwasilewski-gitlab
- 2020-04-28T15:26:32Z - find a way to kill running sidekiq jobs - mwasilewski-gitlab
Open Issue Stats
- Oncall issues : 5
- Change issues : 1
- Incident issues : 1
- Access Request : 5
- CorrectiveAction : 90
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-14T20:25:05Z | cindy | Migrate large projects off file-25-stor-gprd to file-01-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-04T10:31:56Z | ahmadsherif | 2020-05-04: possible data loss via external diffs migration |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-01T16:01:36Z | nnelson | Import request (for alex-solutions/core): alex-app |
2020-04-23T12:53:53Z | unassigned | Set GITLAB_QA_FORMLESS_LOGIN_TOKEN variable on /etc/gitlab/gitlab.rb on live environments |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-23T23:43:57Z | ggillies | Manually remove project |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant