OnCall report for period: 2019-08-06 - 2019-08-13
Oncall during this period
Schedule | Username |
---|---|
SRE | Devin Sylva |
SRE | Amar Amarsanaa |
SRE | Henri Philipps |
SRE | Michal Wasilewski |
SRE | Craig Miskell |
SRE | Craig Furman |
PagerDuty Incidents
* Number of incidents: **48**
Show/Hide Table
Created | Summary |
---|---|
2019-08-06T10:49:52Z | [14700] Firing 2 - IncreasedErrorRateOtherBackends |
2019-08-06T16:33:41Z | [14701] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-06T16:52:27Z | [14702] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/ is down |
2019-08-06T16:52:32Z | [14703] Pingdom check check:https://gitlab.com/projects/new is down |
2019-08-06T16:52:36Z | [14704] Firing 1 - High Error Rate on Front End Web |
2019-08-06T16:52:36Z | [14705] Firing 1 - Increased Error Rate Across Fleet |
2019-08-06T16:52:42Z | [14706] Firing 1 - The alert test file is missing! |
2019-08-06T17:31:16Z | [14707] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-08-06T17:31:51Z | [14708] Firing 2 - IncreasedErrorRateOtherBackends |
2019-08-06T17:47:13Z | [14709] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-06T18:07:56Z | [14710] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-06T18:43:04Z | [14712] Firing 1 - Gitaly error rate is too high: 11.16 |
2019-08-07T07:47:31Z | [14714] Firing 2 - ChefClientErrorCritical |
2019-08-08T05:32:43Z | [14721] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T05:36:57Z | [14722] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-08T06:16:52Z | [14723] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T06:59:51Z | [14724] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T07:30:33Z | [14725] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T07:34:57Z | [14726] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-08T08:17:53Z | [14729] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T09:04:46Z | [14730] Firing 1 - Large amount of Sidekiq Queued jobs: 63550 |
2019-08-08T09:09:53Z | [14731] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T09:38:52Z | [14732] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T10:14:51Z | [14733] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-08-08T12:05:57Z | [14734] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T12:45:41Z | [14735] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T13:06:12Z | [14736] Firing 1 - PostgreSQL replication slot patroni_04_db_gprd_c_gitlab_production_internal on patroni-01-db-gprd.c.gitlab-production.internal is |
falling behind. | |
2019-08-08T16:25:41Z | [14738] Firing 3 - PostgreSQL_ReplicaStaleXmin |
2019-08-08T16:31:31Z | [14739] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T17:25:41Z | [14740] Firing 1 - PostgreSQL replication slot patroni_05_db_gprd_c_gitlab_production_internal on patroni-01-db-gprd.c.gitlab-production.internal is |
falling behind. | |
2019-08-08T17:57:17Z | [14741] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T21:00:57Z | [14742] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-08T21:04:56Z | [14743] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-08T23:18:15Z | [14744] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-09T01:20:41Z | [14745] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-09T01:26:13Z | [14746] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-09T04:14:50Z | [14747] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-09T06:05:24Z | [14748] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-09T08:33:12Z | [14749] Firing 5 - PostgreSQL_ReplicaStaleXmin |
2019-08-09T09:32:41Z | [14750] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-10T00:30:32Z | [14752] Firing 2 - SSLCertExpiresSoon |
2019-08-11T23:44:52Z | [14755] Firing 2 - IncreasedBackendConnectionErrors |
2019-08-11T23:45:37Z | [14756] Firing 2 - IncreasedServerConnectionErrors |
2019-08-13T01:54:18Z | [14759] Firing 1 - Connection of Redis replicas to the master is flapping |
2019-08-13T01:54:34Z | [14760] Firing 1 - Redis master missing for redis-sidekiq |
2019-08-13T02:09:47Z | [14761] Firing 1 - Redis master missing for redis-sidekiq |
2019-08-13T02:12:24Z | [14762] Firing 1 - Redis cluster redis-sidekiq is missing instances |
2019-08-13T02:44:49Z | [14763] Firing 1 - Redis cluster redis-sidekiq is missing instances |
7 Day Issue Stats
- Oncall issues : 3
- Access Request : 0
- Change Issues : 6
- Incident Issues : 6
- CorrectiveAction Issues : 0
Change Issues
- 2019-08-09T14:36:53Z - Block customers.gitlab.com access by country. - cmcfarland
- 2019-08-08T18:18:03Z - Restart Consul agents to apply TLS configrations - unassigned
- 2019-08-07T19:19:55Z - Roll back chef client on license.gitlab.com - cmcfarland
- 2019-08-07T06:30:25Z - rotate the package key for gitlab/pre-release - jarv
- 2019-08-06T14:56:46Z - Experimentally increase the number of sidekiq processes on sidekiq-besteffort-01-sv-gprd.c.gitlab-production.internal - unassigned
- 2019-08-06T10:23:45Z - Force eager provisioning of GCP disks after size increase - adescoms
Incident Issues
- 2019-08-09T09:36:03Z - postgres slow down - unassigned | | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1046
- 2019-08-08T20:51:52Z - Error 500 creating merge requests due to internal_ids timeout - unassigned | | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1043
- 2019-08-08T13:11:10Z - Merge requests getting closed inadvertently - unassigned | ~S1 | ~"Service:Sidekiq" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1040
- 2019-08-08T06:28:30Z - Gitaly performance problems on some servers - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1039
- 2019-08-07T14:02:19Z - consul: SSL certs expired on Aug 3 - ansdval | ~S4 | ~"Service:Consul" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1037
- 2019-08-06T17:08:28Z - Postgres Blip while Switching Master - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1035
CorrectiveAction Issues
- 2019-08-08T22:11:38Z - sshguard Blocking access to consul servers - cmiskell
- 2019-08-08T17:13:45Z - Identify limits that would have prevented recent incidents - unassigned
- 2019-08-07T20:29:08Z - Upgrade Consul to Allow for Live TLS reloads - unassigned
- 2019-08-06T08:12:22Z - Address
besteffort
slowdown - andrewn
Open Issue Stats
- Oncall issues : 20
- Change issues : 12
- Incident issues : 0
- Access Request : 4
- CorrectiveAction : 88
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-08-06T10:23:45Z | adescoms | Force eager provisioning of GCP disks after size increase |
2019-08-02T18:52:15Z | nnelson | Migrate large projects off file-28-stor-gprd to file-36-stor-gprd |
2019-08-02T18:51:05Z | nnelson | Migrate large projects off file-24-stor-gprd to file-36-stor-gprd |
2019-08-01T21:29:30Z | nnelson | Migrate large projects off file-25-stor-gprd to file-36-stor-gprd |
2019-07-19T09:19:15Z | jarv | Reduce unicorn worker count on two web workers |
2019-07-16T16:23:18Z | cmcfarland | Enable pages access control setting in gitlab.rb |
2019-07-13T01:03:58Z | msmiley | Prevent residual HAProxy processes by setting hard-stop-after
|
2019-07-09T03:15:10Z | cmiskell | Enable camoproxy functionality |
2019-06-21T12:55:43Z | unassigned | Tune down idle_in_transaction_session_timeout |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-06-07T16:08:53Z | aamarsanaa | add more storage nodes and rebalance existing ones |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-08-08T19:33:07Z | hphilipps | RCA Merge requests getting closed inadvertently |
2019-08-08T17:13:45Z | unassigned | Identify limits that would have prevented recent incidents |
2019-08-08T15:19:47Z | hphilipps | RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow |
2019-08-05T19:49:35Z | ansdval | Note Creation on commit via API Calls Halted new_note Sidekiq Queue |
2019-08-01T15:58:00Z | hphilipps | RCA Gitaly latency below SLO |
2019-08-01T04:47:51Z | unassigned | update GitLab.com to not require 2FA on the Gitlab.com side for SAML Logins via Okta |
2019-07-31T13:01:20Z | hphilipps | Gitaly SLO alerts should go to pagerduty |
2019-07-31T12:55:40Z | hphilipps | Alert on gitaly file descriptors |
2019-07-30T09:03:49Z | unassigned | Update Snowplow credentials on customers.gitlab.com |
2019-07-29T07:52:37Z | unassigned | Console host: patroni hostname doesnt resolve through consul dns |
2019-07-17T09:44:07Z | unassigned | Identify and Kill Zombie Sidekiq Jobs |
2019-07-03T20:00:25Z | cmiskell | DNS: Wildcard record for "serverless-evaluation.sec.gitlab.net" |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-18T01:44:01Z | unassigned | Review load-balancing configuration for registry |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-06-11T11:30:31Z | unassigned | no alert for customer.gitlab.com being down |
2019-06-04T17:29:12Z | ahanselka | Whitelist Atlasssian IP address Space for API Calls |
2019-05-27T14:16:28Z | unassigned | Automate SSLMate Certificate grabs |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | unassigned | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith