OnCall report for period: 2019-09-17 - 2019-09-24
Oncall during this period
Schedule | Username |
---|---|
SRE | Alex Hanselka |
SRE | Hendrik Meyer |
SRE | Henri Philipps |
PagerDuty Incidents
Show/Hide Table
Created | Summary |
---|---|
2019-09-17T09:14:24Z | [15083] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-17T10:14:07Z | [15084] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-17T10:25:35Z | [15085] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-17T11:36:57Z | [15086] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
| | 2019-09-17T12:20:27Z | [15087] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-17T12:42:52Z | [15088] Firing 2 - AlertmanagerNotificationsFailing | | 2019-09-17T12:53:14Z | [15089] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-17T13:33:43Z | [15090] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-17T15:43:44Z | [15091] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-17T15:52:28Z | [15092] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down | | 2019-09-17T18:12:50Z | [15093] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-18T04:31:36Z | [15096] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-18T07:05:12Z | [15097] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-18T08:05:49Z | [15098] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-18T08:07:19Z | [15099] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/ is down | | 2019-09-18T08:50:37Z | [15100] Firing 1 - Increased Error Rate Across Fleet | | 2019-09-18T08:50:57Z | [15101] Firing 1 - Gitaly is down on file-35-stor-gprd.c.gitlab-production.internal | | 2019-09-18T08:55:31Z | [15102] Firing 1 - High Rails Error Rate on Front End | | 2019-09-18T12:44:41Z | [15103] Firing 1 - 1% disk space left | | 2019-09-19T02:48:56Z | [15104] Firing 1 - Gitaly is down on file-33-stor-gprd.c.gitlab-production.internal | | 2019-09-19T02:53:24Z | [15105] Firing 1 - High Rails Error Rate on Front End | | 2019-09-19T12:58:35Z | [15107] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-19T14:56:06Z | [15108] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-19T15:13:05Z | [15109] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-19T15:28:35Z | [15110] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-19T18:08:49Z | [15111] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T00:58:05Z | [15112] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T01:54:21Z | [15113] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T02:42:05Z | [15114] Firing 1 - Gitaly error rate is too high: 10.22 | | 2019-09-20T03:17:07Z | [15116] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T03:26:05Z | [15117] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T04:03:35Z | [15118] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T04:10:13Z | [15119] Firing 1 - Gitaly is down on file-09-stor-gprd.c.gitlab-production.internal | | 2019-09-20T04:15:29Z | [15120] Firing 1 - Gitaly is down on file-09-stor-gprd.c.gitlab-production.internal | | 2019-09-20T04:15:57Z | [15121] Firing 1 - Gitaly is down on file-09-stor-gprd.c.gitlab-production.internal | | 2019-09-20T04:25:50Z | [15122] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T05:13:35Z | [15123] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T06:57:49Z | [15124] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T07:34:20Z | [15125] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T07:43:20Z | [15126] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-20T08:01:21Z | [15127] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-21T01:01:36Z | [15128] Firing 1 - Chef client failures have reached critical levels | | 2019-09-21T03:58:26Z | [15129] Firing 1 - The public dashboard page is down | | 2019-09-21T13:26:09Z | [15130] Firing 1 - Gitaly error rate is too high: 7.87 | | 2019-09-22T13:31:58Z | [15131] Need approval to execute a production change #1177 (closed) | | 2019-09-23T03:50:49Z | [15132] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-23T04:07:49Z | [15133] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-23T09:14:35Z | [15134] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-23T13:37:57Z | [15137] Firing 1 - Chef client failures have reached critical levels | | 2019-09-23T14:38:41Z | [15138] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-24T00:44:54Z | [15144] Firing 1 - Large number of overdue pull mirror jobs: 16090 | | 2019-09-24T01:47:05Z | [15146] Firing 1 - Gitaly error rate is too high: 29.20 |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 14
- Incident Issues : 9
- CorrectiveAction Issues : 0
Change Issues
- 2019-09-23T16:47:40Z - WIP: Rotate Gitaly Token - unassigned
- 2019-09-23T05:20:12Z - Add gitlab shared-runners managers to whitelist - ggillies
- 2019-09-23T03:42:08Z - Add gitlab-ci Cloud NAT IP addresses to haproxy whitelist - cmiskell
- 2019-09-22T13:22:59Z - Enable
fsyncObjectFiles
on git storage to prevent file corruption - unassigned - 2019-09-20T23:05:35Z - Bump process limits for Patroni and PgBouncer - ahmadsherif
- 2019-09-19T21:39:42Z - Upgrade Patroni to v1.6.0 - ahmadsherif
- 2019-09-18T20:57:03Z - Add environment variables to GitLab.com servers - ahanselka
- 2019-09-18T15:16:48Z - Migrate large projects off file-26-stor-gprd to file-39-stor-gprd - nnelson
- 2019-09-18T12:04:34Z - Expand and use Cloud NAT in gprd - cmiskell
- 2019-09-18T10:00:20Z - Add two more nodes to the realtime fleet - unassigned
- 2019-09-18T01:05:09Z - WIP: Increase Patroni's patience when talking with Consul - msmiley
- 2019-09-17T17:43:23Z - Add snowplow_enabled string to config for license server - cmcfarland
- 2019-09-17T11:27:34Z - Increase wal-prefetch for archive replica - abrandl
- 2019-09-17T08:05:27Z - Update /tmp directory permissions of file-33-stor-gprd.c.gitlab-production.internal - jarv
Incident Issues
- 2019-09-24T00:25:28Z - gitlab-org/gitlab project returning 404 - unassigned | ~S1 | ~"Service:Git" ~"Service:Gitaly" ~"Service:Sidekiq" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1183
- 2019-09-21T13:50:07Z - 2019-09-21
GetArchive
caused high gitaly error rates alert on singe project - unassigned | ~S4 | ~"Service:Gitaly" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1176
- 2019-09-20T04:43:44Z - 2019-09-20 Gitaly server file-09-stor-gprd rebooted due to GCP Host error - T4cC0re | ~S1 | ~"Service:Gitaly" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1174
- 2019-09-20T02:41:29Z - 2018-09-20 Some board lists missing for some users - ashmckenzie | ~S1 | ~"Service:GitLab Rails" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1173
- 2019-09-19T10:08:28Z - 2019-09-19 Spike in DB errors causing increased request latency - T4cC0re | ~S1 | ~"Service:API" ~"Service:Postgres" ~"Service:Web" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1171
- 2019-09-18T16:51:19Z - CI Minutes warning message - skarbek | ~S4 | ~"Service:CI Runners" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1169
- 2019-09-18T08:58:41Z - Gitaly down on file-35 - T4cC0re | ~S2 | ~"Service:Gitaly" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1165
- 2019-09-18T07:31:54Z - A spike of
invalid checksum digest format
errors causes elevated 5xx rate. - T4cC0re | ~S4 | ~"Service:Registry" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1164
- 2019-09-17T16:51:46Z - GitLab logging infrastructure down since ~16h00 UTC - dawsmith | ~S2 | ~"Service:ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1161
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 14
- Change issues : 19
- Incident issues : 3
- Access Request : 4
- CorrectiveAction : 83
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-23T16:47:40Z | unassigned | WIP: Rotate Gitaly Token |
2019-09-20T23:05:35Z | ahmadsherif | Bump process limits for Patroni and PgBouncer |
2019-09-19T21:39:42Z | ahmadsherif | Upgrade Patroni to v1.6.0 |
2019-09-18T12:04:34Z | cmiskell | Expand and use Cloud NAT in gprd |
2019-09-18T10:00:20Z | unassigned | Add two more nodes to the realtime fleet |
2019-09-18T01:05:09Z | msmiley | WIP: Increase Patroni's patience when talking with Consul |
2019-09-17T17:43:23Z | cmcfarland | Add snowplow_enabled string to config for license server |
2019-09-16T17:10:18Z | cmcfarland | Run chef with new us-central role in ops-us-central |
2019-09-12T08:00:43Z | andrewn | Add more hosts to the pipeline fleet |
2019-09-09T20:43:17Z | ahanselka | Experimental Changes to runner configuration - grow understanding on runner cron issues. |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-27T14:44:48Z | asaba | Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-08-06T10:23:45Z | adescoms | Force eager provisioning of GCP disks after size increase |
2019-07-16T16:23:18Z | cmcfarland | WIP: Enable pages access control setting in gitlab.rb |
2019-07-02T20:54:46Z | devin | Migrate to Hashed Storage from legacy project storage |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-20T02:41:29Z | ashmckenzie | 2018-09-20 Some board lists missing for some users |
2019-09-19T10:08:28Z | T4cC0re | 2019-09-19 Spike in DB errors causing increased request latency |
2019-09-18T08:58:41Z | T4cC0re | Gitaly down on file-35 |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-23T23:41:12Z | unassigned | Security Req: Pipelines Deleted from Accidental Commits |
2019-09-17T18:10:06Z | unassigned | Import request (for nq-develop): nq-app-android |
2019-09-14T10:06:43Z | hphilipps | file-15-stor-gprd rebooted |
2019-09-12T12:34:42Z | hphilipps | file-33-stor-gprd rebooted |
2019-09-12T08:04:32Z | ahanselka | Rotate Gitaly authentication tokens after 12.3.2 deploy |
2019-09-11T19:47:14Z | unassigned | Add more hosts to the pipeline fleet |
2019-09-11T04:19:29Z | hphilipps | file-35-stor-gprd rebooted |
2019-09-06T19:54:36Z | unassigned | GitLab CE review apps are failing to acquire External IP for NGINX Ingress Controller |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-07-03T20:00:25Z | cmiskell | DNS: Wildcard record for "serverless-evaluation.sec.gitlab.net" |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | unassigned | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant