OnCall report for period: 2019-09-24 - 2019-10-01
Oncall during this period
Schedule | Username |
---|---|
SRE | Ahmad Sherif |
SRE | Alejandro Rodriguez |
SRE | Hendrik Meyer |
SRE | Craig Miskell |
PagerDuty Incidents
* Number of incidents: **32**
Show/Hide Table
Created | Summary |
---|---|
2019-09-24T14:51:07Z | [15149] Firing 1 - High 4xx Error Rate on Docker Registry |
2019-09-24T23:46:39Z | [15161] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-25T04:44:07Z | [15162] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-25T05:09:14Z | [15163] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-25T09:44:22Z | [15164] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-25T10:30:20Z | [15165] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-25T12:07:57Z | [15166] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-25T12:45:29Z | [15167] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-25T14:06:42Z | [15168] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag |
2019-09-25T15:15:30Z | [15169] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-25T16:35:37Z | [15171] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-25T22:02:22Z | [15176] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-26T12:32:40Z | [15179] Firing 1 - 1% disk space left |
2019-09-26T14:31:18Z | [15182] Firing 1 - CPU use percent is extremely high on pubsub-rails-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
2019-09-26T14:32:50Z | [15183] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T03:30:51Z | [15192] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T06:57:08Z | [15193] Firing 1 - Prometheus not connected to any Alertmanagers |
2019-09-27T09:10:36Z | [15194] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T12:27:22Z | [15196] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T12:39:07Z | [15197] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T12:54:39Z | [15198] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T13:44:36Z | [15199] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-27T20:53:08Z | [15200] Firing 1 - Gitaly error rate is too high: 5.18 |
2019-09-29T08:35:41Z | [15202] Firing 1 - The alert test file is missing! |
2019-09-29T08:36:15Z | [15203] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-09-30T04:13:59Z | [15207] Firing 1 - GitLab.com is down for 2 minutes |
2019-09-30T04:14:00Z | [15208] Firing 1 - GitLab.com is down for 2 minutes |
2019-09-30T14:04:04Z | [15210] Firing 1 - Gitaly error rate is too high: 10.84 |
2019-09-30T14:10:42Z | [15211] Firing 1 - Postgres seems to be processing very few transactions |
2019-09-30T14:12:11Z | [15212] Firing 1 - patroni-02-db-gprd.c.gitlab-production.internal postgres service appears down |
| | 2019-09-30T22:53:09Z | [15213] Firing 1 - Large number of overdue pull mirror jobs: 12287 | | 2019-09-30T23:38:07Z | [15214] Firing 1 - Large number of overdue pull mirror jobs: 16273.5 |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 9
- Incident Issues : 5
- CorrectiveAction Issues : 0
Change Issues
- 2019-10-01T00:17:53Z - Add gitlab_server::hack_system_gitconfig recipe to gitaly nodes - cmiskell
- 2019-09-27T10:54:08Z - Roll out Cloud NAT to CI shared runners - hphilipps
- 2019-09-27T09:14:43Z - Enable PlantUML integration on GitLab.com - jarv
- 2019-09-27T03:10:41Z - Disable pcid and pti on api-16-sv-gprd - cmiskell
- 2019-09-26T17:34:05Z - Cleanup moved gitlab-foss issues - felipe_artur
- 2019-09-26T11:16:31Z - Increase the node pool size of the production GKE cluster - jarv
- 2019-09-26T09:08:08Z - Update robots.txt for asset crawling - jarv
- 2019-09-26T02:44:53Z - Reduce git FE fleet size - cmiskell
- 2019-09-25T21:22:42Z - Delete zero-byte files on
file-33-stor-gprd.c.gitlab-production.internal
forProject id:13691441 TARDIS_MM/ewc/admin
- nnelson
Incident Issues
- 2019-09-29T11:10:43Z - 2019-09-29: Partial degradation on canary - ahmadsherif | ~S2 | ~"Service:Web" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1207
- 2019-09-28T21:12:52Z - Multiple reports of slowness + errors; sidekiq ASAP priority fleet super busy - cmiskell | ~S2 | ~"Service:Sidekiq" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1204
- 2019-09-27T19:19:28Z - elasticsearch group search leaks private data - mdelaossa | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1200
- 2019-09-26T13:14:24Z - 2019-09-26: InfluxDB nodes were not using the persistent data disk - ahmadsherif | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1193
- 2019-09-25T09:29:49Z - 2019-09-25: High error rate on Docker Registry LBs - ahmadsherif | ~S4 | ~"Service:Registry" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1186
CorrectiveAction Issues
- 2019-10-01T00:24:02Z - Remove pgio from staging patroni servers - unassigned
- 2019-09-30T19:21:10Z - Update Runbooks to be more useful during an accidental project deletion. - unassigned
- 2019-09-30T17:12:42Z - DB Export - GitLab.com Admin Accounts & Token Details - unassigned
- 2019-09-28T23:20:56Z - Reduce threshold for alerting on sidekiq new_note queue. - cmiskell
Open Issue Stats
- Oncall issues : 10
- Change issues : 16
- Incident issues : 0
- Access Request : 4
- CorrectiveAction : 78
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-27T10:54:08Z | hphilipps | Roll out Cloud NAT to CI shared runners |
2019-09-26T17:34:05Z | felipe_artur | Cleanup moved gitlab-foss issues |
2019-09-23T16:47:40Z | unassigned | WIP: Rotate Gitaly Token |
2019-09-19T21:39:42Z | ahmadsherif | Upgrade Patroni to v1.6.0 |
2019-09-18T10:00:20Z | unassigned | Add two more nodes to the realtime fleet |
2019-09-18T01:05:09Z | msmiley | WIP: Increase Patroni's patience when talking with Consul |
2019-09-16T17:10:18Z | cmcfarland | Run chef with new us-central role in ops-us-central |
2019-09-12T08:00:43Z | andrewn | Add more hosts to the pipeline fleet |
2019-09-09T20:43:17Z | ahanselka | Experimental Changes to runner configuration - grow understanding on runner cron issues. |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-27T14:44:48Z | asaba | Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-07-02T20:54:46Z | devin | Migrate to Hashed Storage from legacy project storage |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-29T20:36:21Z | cmiskell | file-35-stor-gprd rebooted |
2019-09-14T10:06:43Z | hphilipps | file-15-stor-gprd rebooted |
2019-09-12T12:34:42Z | hphilipps | file-33-stor-gprd rebooted |
2019-09-12T08:04:32Z | ahanselka | Rotate Gitaly authentication tokens after 12.3.2 deploy |
2019-09-11T04:19:29Z | hphilipps | file-35-stor-gprd rebooted |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | cmiskell | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant