OnCall report for period: 2019-11-05 - 2019-11-12
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Cameron McFarland |
SRE 8 Hour | Michal Wasilewski |
SRE 8 Hour | Craig Miskell |
PagerDuty Incidents
* Number of incidents: **95**
Show/Hide Table
Created | Summary |
---|---|
2019-11-05T08:31:57Z | [15710] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-05T09:22:52Z | [15711] Firing 2 - IncreasedErrorRateOtherBackends |
2019-11-05T12:07:35Z | [15713] Firing 1 - Increased Error Rate Across Fleet |
2019-11-05T16:09:06Z | [15723] Firing 1 - Increased Error Rate Across Fleet |
2019-11-05T16:13:09Z | [15724] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-05T16:51:42Z | [15726] Firing 1 - Last WALE backup was seen 20m 1s ago. |
2019-11-05T20:28:02Z | [15731] Firing 2 - IncreasedErrorRateOtherBackends |
2019-11-05T20:34:07Z | [15732] Firing 1 - The public dashboard page is down |
2019-11-05T23:18:19Z | [15734] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T00:02:37Z | [15735] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T10:35:57Z | [15739] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T10:45:38Z | [15741] Firing 1 - Increased Error Rate Across Fleet |
2019-11-06T10:46:00Z | [15742] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2019-11-06T11:03:47Z | [15746] increased latency and error rates across the entire fleet |
2019-11-06T11:12:32Z | [15747] Firing 1 - Last WALE backup was seen 16m 45s ago. |
2019-11-06T11:13:41Z | [15748] Firing 1 - 1% disk space left |
2019-11-06T12:01:16Z | [15749] Firing 1 - Last WALE backup was seen 20m 9s ago. |
2019-11-06T12:36:46Z | [15750] Firing 1 - Last WALE backup was seen 20m 1s ago. |
2019-11-06T12:49:25Z | [15751] Firing 3 - ProcessCommitWorkersTooHigh |
2019-11-06T13:05:34Z | [15752] Firing 1 - Increased Error Rate Across Fleet |
2019-11-06T13:08:22Z | [15753] Firing 1 - Last WALE backup was seen 20m 5s ago. |
2019-11-06T13:20:37Z | [15754] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2019-11-06T13:21:02Z | [15756] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2019-11-06T13:21:16Z | [15757] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2019-11-06T13:21:24Z | [15758] Firing 1 - Increased Error Rate Across Fleet |
2019-11-06T13:22:31Z | [15760] Pingdom check check:https://gitlab.com/projects/new is down |
2019-11-06T13:22:38Z | [15761] Firing 1 - High Error Rate on Front End Web |
2019-11-06T13:23:06Z | [15762] Pingdom check check:https://gitlab.com/ is down |
2019-11-06T13:23:08Z | [15763] Firing 87 - IncreasedBackendConnectionErrors |
2019-11-06T13:23:15Z | [15764] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2019-11-06T13:23:32Z | [15765] Firing 1 - GitLab.com is down for 2 minutes |
2019-11-06T13:23:32Z | [15766] Firing 1 - GitLab.com is down for 2 minutes |
2019-11-06T13:23:39Z | [15767] Firing 21 - IncreasedServerResponseErrors |
2019-11-06T13:23:45Z | [15768] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2019-11-06T13:23:56Z | [15769] Firing 1 - PGBouncer is logging "no more connections allowed" errors |
2019-11-06T13:24:10Z | [15771] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2019-11-06T13:44:42Z | [15775] Firing 3 - ProcessCommitWorkersTooHigh |
2019-11-06T14:09:06Z | [15776] Firing 1 - Gitaly error rate is too high: 11.20 |
2019-11-06T14:34:23Z | [15777] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T14:34:23Z | [15778] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T14:41:51Z | [15781] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T14:41:51Z | [15780] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T14:45:45Z | [15783] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-11-06T14:51:51Z | [15784] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T14:51:52Z | [15785] Firing 1 - Alertmanager is failing sending notications |
2019-11-06T15:36:00Z | [15789] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T16:02:30Z | [15790] Firing 1 - 1% disk space left |
2019-11-06T16:08:26Z | [15791] Firing 4 - IncreasedServerResponseErrors |
2019-11-06T16:08:53Z | [15792] Firing 4 - IncreasedServerConnectionErrors |
2019-11-06T16:08:54Z | [15793] Firing 2 - IncreasedBackendConnectionErrors |
2019-11-06T16:20:31Z | [15794] Firing 1 - Last WALE backup was seen 20m 14s ago. |
2019-11-06T17:36:01Z | [15796] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T17:45:33Z | [15797] Firing 1 - CPU use percent is extremely high on fe-11-lb-gprd.c.gitlab-production.internal for the past 2 hours. |
2019-11-06T19:27:58Z | [15798] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-06T19:39:29Z | [15799] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-07T08:07:22Z | [15802] Firing 2 - IncreasedErrorRateOtherBackends |
2019-11-07T09:22:58Z | [15805] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-07T11:26:56Z | [15807] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-07T12:07:35Z | [15808] Firing 1 - Increased Error Rate Across Fleet |
2019-11-07T20:14:14Z | [15813] Firing 1 - Chef client failures have reached critical levels |
2019-11-08T01:08:42Z | [15815] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-08T09:08:32Z | [15817] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-08T09:44:15Z | [15819] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-08T17:09:26Z | [15822] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-08T18:20:33Z | [15823] Firing 1 - Gitaly: High CPU usage on file-40-stor-gprd.c.gitlab-production.internal |
2019-11-08T18:37:42Z | [15824] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-08T20:59:13Z | [15825] Firing 1 - PostgreSQL dead tuples is too large |
2019-11-10T02:07:10Z | [15828] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T02:42:13Z | [15830] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T02:44:28Z | [15831] Firing 1 - Chef client failures have reached critical levels |
2019-11-10T04:32:12Z | [15833] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T04:52:12Z | [15834] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T06:42:11Z | [15835] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T07:17:10Z | [15836] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T08:57:12Z | [15837] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T10:17:29Z | [15838] Firing 2 - PostgreSQL_ExporterErrors |
2019-11-10T10:52:10Z | [15839] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T11:07:11Z | [15840] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T11:47:09Z | [15842] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T12:22:11Z | [15843] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T12:32:11Z | [15844] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-10T13:12:12Z | [15845] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-11T04:08:09Z | [15847] Firing 1 - thanos is restarting frequently |
2019-11-11T11:38:25Z | [15855] Firing 1 - Increased Error Rate Across Fleet |
2019-11-11T11:46:21Z | [15857] CI service is downgraded, a number of other issues potentially related |
2019-11-11T12:02:59Z | [15859] Firing 1 - Chef client failures have reached critical levels |
2019-11-11T12:07:36Z | [15860] Firing 2 - ChefClientErrorCritical |
2019-11-11T12:07:50Z | [15861] Firing 1 - Increased Error Rate Across Fleet |
2019-11-11T12:29:24Z | [15863] Firing 2 - IncreasedServerResponseErrors |
2019-11-11T13:28:59Z | [15865] Firing 1 - Chef client failures have reached critical levels |
2019-11-11T14:18:07Z | [15867] Firing 1 - Increased Server Response Errors |
2019-11-11T17:38:43Z | [15870] Stale DNS entry pointing to another website |
2019-11-11T20:07:35Z | [15871] Firing 1 - Increased Error Rate Across Fleet |
2019-11-12T00:22:31Z | [15873] Firing 1 - Gitaly: High CPU usage on file-38-stor-gprd.c.gitlab-production.internal |
2019-11-12T01:27:19Z | [15874] Firing 1 - Postgres exporter is showing errors for the last hour |
7 Day Issue Stats
- Oncall issues : 3
- Access Request : 1
- Change Issues : 13
- Incident Issues : 16
- CorrectiveAction Issues : 0
Change Issues
- 2019-11-12T08:03:52Z - increase memory limits on Gitaly nodes to prevent OOM kills while we investigate saturation - andrewn
- 2019-11-12T00:15:23Z - Set nopcid on api fleet - cmiskell
- 2019-11-11T18:16:40Z - Remove stale DNS for private-runners-manager-1 and 2 - cmcfarland
- 2019-11-11T09:15:58Z - enable gitaly info/refs caching using the gitaly_inforef_uploadpack_cache feature flag - mwasilewski-gitlab
- 2019-11-08T15:07:36Z - Upgrade ruby to 2.6 on customers.gitlab.com - cmcfarland
- 2019-11-08T08:10:56Z - Reduce used_memory_rss of redis-server on redis-cache so it is within maxmemory limit - unassigned
- 2019-11-08T04:27:19Z - Increase sidekiq/pullmirror process count from 4 => 5 - unassigned
- 2019-11-08T03:37:48Z - Scaling up RO pgbouncers to 3 - unassigned
- 2019-11-07T00:40:44Z - Switch traffic off pgbouncer-03 and restart to fix pgbouncer_exporter - cmiskell
- 2019-11-06T16:55:56Z - Resize log volumes on all patroni nodes - craig
- 2019-11-06T00:09:14Z - Delete stale residual Redis RDB files from redis-cache hosts - msmiley
- 2019-11-05T21:31:17Z - Update PgBouncer to v1.12.0 - unassigned
- 2019-11-05T19:36:38Z - Migrate some active projects away from file-02. - cmcfarland
Incident Issues
- 2019-11-11T20:09:47Z - 2019-11-11 20:08 UTC Increased Error Rate Across Fleet - cmcfarland | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1351
- 2019-11-11T11:41:35Z - 2019-11-11 CI runners and Registry errors and high latency - mwasilewski-gitlab | ~S2 | ~"Service:CI Runners" ~"Service:Registry" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1349
- 2019-11-11T11:21:59Z - 2019-11-11 sidekiq error rates are high - unassigned | ~S4 | ~"Service:Sidekiq" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1348
- 2019-11-11T11:00:03Z - 2019-11-11: all services behind IAP are unavailable - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1347
- 2019-11-11T04:33:45Z - 2019-11-11 04:24 The
gitaly
service (main
stage) has a apdex score (latency) below SLO - cmiskell | ~S3 | ~"Service:Gitaly" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1345
- 2019-11-10T04:46:12Z - 2019-11-10 04:32: version.gitlab.com down - cmiskell | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1343
- 2019-11-10T04:33:37Z - 2019-11-10 04:30 ci-runners apdex (latency) below SLO - cmiskell | ~S3 | ~"Service:CI Runners" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1342
- 2019-11-06T12:57:46Z - 2019-11-06: process_commit sidekiq queue high - unassigned | | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1329
- 2019-11-06T12:56:15Z - Stale file handles on production nodes - jarv | ~S3 | ~"Service:Infrastructure" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1328
- 2019-11-06T10:41:22Z - 2019-11-06: increased latency and error rates across the fleet - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1327
- 2019-11-05T20:38:45Z - 2019-11-05 The public dashboard page is down - cmcfarland | ~S2 | ~"Service:Dashboards" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1324
- 2019-11-05T16:11:08Z - 2019-11-05 Increased Error Rate Across Fleet - cmcfarland | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1320
- 2019-11-05T15:35:00Z - 2019-11-05 The
ci-runners
service (main
stage) has a apdex score (latency) below SLO - unassigned | ~S4 | ~"Service:CI Runners" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1319
- 2019-11-05T15:26:29Z - 2019-11-05 The
web
service (cny
stage) has a apdex score (latency) below SLO - unassigned | | ~"Service:Web" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1318
- 2019-11-05T13:10:34Z - increased traffic to selected gitaly nodes result in higher gitaly latency - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1314
- 2019-11-05T09:41:03Z - brief disk I/O latency increase on gitaly nodes led to an increase in 5xx errors in production - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1313
CorrectiveAction Issues
- 2019-11-07T07:25:52Z - test interrupting running sidekiq jobs - unassigned
- 2019-11-06T20:03:17Z - Tweak alert for disk use of /var/log on patroni servers to alert sooner - cmcfarland
- 2019-11-06T15:49:29Z - Update runbooks to provide good documentation on how to associate sidekiq jobs with projects (or users). - unassigned
Open Issue Stats
- Oncall issues : 15
- Change issues : 25
- Incident issues : 1
- Access Request : 6
- CorrectiveAction : 71
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-11-12T08:03:52Z | andrewn | increase memory limits on Gitaly nodes to prevent OOM kills while we investigate saturation |
2019-11-12T00:15:23Z | cmiskell | Set nopcid on api fleet |
2019-11-11T09:15:58Z | mwasilewski-gitlab | enable gitaly info/refs caching using the gitaly_inforef_uploadpack_cache feature flag |
2019-11-08T08:10:56Z | unassigned | Reduce used_memory_rss of redis-server on redis-cache so it is within maxmemory limit |
2019-11-08T04:27:19Z | unassigned | Increase sidekiq/pullmirror process count from 4 => 5 |
2019-11-08T03:37:48Z | unassigned | Scaling up RO pgbouncers to 3 |
2019-11-06T00:09:14Z | msmiley | Delete stale residual Redis RDB files from redis-cache hosts |
2019-11-05T21:31:17Z | unassigned | Update PgBouncer to v1.12.0 |
2019-10-30T12:59:31Z | unassigned |
WIP enhance logging capabilities to PostgreSQL instances (log_connections & log_disconnections ) |
2019-10-16T14:37:43Z | nnelson | Migrate large projects off file-33-stor-gprd to file-40-stor-gprd |
2019-10-15T17:18:17Z | gerardo.herzig | WIP CR for fixing backup logging |
2019-10-15T15:13:30Z | nnelson | Migrate large projects off file-34-stor-gprd to file-40-stor-gprd |
2019-10-10T07:47:57Z | jarv | WIP: Remove the shared filesystem mounts on staging.gitlab.com |
2019-10-02T18:40:35Z | nnelson | Migrate large projects off file-33-stor-gprd to file-40-stor-gprd |
2019-09-27T10:54:08Z | hphilipps | Roll out Cloud NAT to CI shared runners |
2019-09-18T10:00:20Z | unassigned | Add two more nodes to the realtime fleet |
2019-09-12T08:00:43Z | andrewn | Add more hosts to the pipeline fleet |
2019-09-09T20:43:17Z | ahanselka | Experimental Changes to runner configuration - grow understanding on runner cron issues. |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-27T14:44:48Z | asaba | Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-07-02T20:54:46Z | devin | Migrate to Hashed Storage from legacy project storage |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-10-29T03:36:15Z | unassigned | 2019-10-29 3:30 UTC Database statement timeouts while inserting data |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-11-08T21:47:11Z | unassigned | Import request (for square): java |
2019-11-07T13:42:12Z | dstull | version app autodevops auto deploy to production fails |
2019-11-01T14:08:54Z | aamarsanaa | Resource saturation: sidekiq_workers/sidekiq, pullmirrors fleet |
2019-11-01T13:56:45Z | aamarsanaa | Resource saturation: redis_memory on redis-cache |
2019-10-29T18:32:57Z | unassigned | New env var & secret for Zuora OAuth Client in customers.gitlab.com |
2019-10-23T13:05:14Z | unassigned | cleanup registered nodes in chef |
2019-10-14T09:44:00Z | unassigned | Rollout SIDEKIQ_MONITOR_WORKER=1 across the sidekiq fleet |
2019-10-10T08:07:26Z | craigf | sidekiq, git error ratio metrics are not showing in the dashboard |
2019-10-08T18:52:44Z | cmcfarland | increase api nodes cpu utilization by adding more unicorn workers |
2019-10-08T10:43:58Z | unassigned | rails console scripts getting OOM killed on console-01-sv-gprd followed by high disk IO and VM being unresponsive |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-08-30T14:34:47Z | unassigned | Customer Staging - Sentry is not reporting |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2017-10-13T00:36:06Z | cmiskell | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant