OnCall report for period: 2019-08-27 - 2019-09-03
Oncall during this period
Schedule | Username |
---|---|
SRE | Ahmad Sherif |
SRE | Craig Barrett |
SRE | Amar Amarsanaa |
SRE | Nels Nelson |
PagerDuty Incidents
* Number of incidents: **45**
Show/Hide Table
Created | Summary |
---|---|
2019-08-27T19:24:29Z | [14902] Firing 1 - Unused Replication Slots for patroni-10-db-gprd.c.gitlab-production.internal |
2019-08-27T19:30:46Z | [14903] Firing 1 - PostgreSQL dead tuples is too large |
2019-08-27T19:52:23Z | [14905] Firing 1 - Postgres exporter is showing errors for the last hour |
2019-08-27T20:33:30Z | [14906] Firing 1 - patroni-04-db-gprd.c.gitlab-production.internal postgres service appears down |
2019-08-28T04:58:21Z | [14907] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-28T05:14:05Z | [14908] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-28T15:11:28Z | [14909] Firing 1 - Unused Replication Slots for patroni-10-db-gprd.c.gitlab-production.internal |
2019-08-28T17:58:38Z | [14910] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-28T22:12:55Z | [14911] Firing 1 - The haproxy service is less available than normal |
2019-08-29T01:45:30Z | [14913] Firing 1 - Postgres seems to be processing very few transactions |
2019-08-29T06:13:15Z | [14917] Firing 1 - The haproxy service is less available than normal |
2019-08-29T11:58:28Z | [14918] Firing 1 - The alert test file is missing! |
2019-08-29T12:04:08Z | [14919] Firing 1 - Gitaly error rate is too high: 10.29 |
2019-08-29T12:04:10Z | [14920] Firing 1 - Gitaly error rate is too high: 4.00 |
2019-08-29T12:09:37Z | [14921] Firing 1 - Gitaly error rate is too high: 12.40 |
2019-08-29T12:14:06Z | [14922] Firing 1 - Gitaly error rate is too high: 3.87 |
2019-08-29T12:14:53Z | [14923] Firing 1 - Gitaly error rate is too high: 10.18 |
2019-08-29T12:19:39Z | [14924] Firing 1 - Gitaly error rate is too high: 3.13 |
2019-08-29T12:24:36Z | [14925] Firing 1 - Gitaly error rate is too high: 3.49 |
2019-08-29T12:43:57Z | [14926] Firing 1 - The alert test file is missing! |
2019-08-29T12:53:57Z | [14927] Firing 1 - The alert test file is missing! |
2019-08-29T13:01:08Z | [14928] Firing 1 - Gitaly error rate is too high: 3.00 |
2019-08-29T13:20:27Z | [14929] Firing 1 - The alert test file is missing! |
2019-08-29T13:27:12Z | [14930] Firing 1 - The alert test file is missing! |
2019-08-29T13:32:06Z | [14931] Firing 1 - Gitaly error rate is too high: 12.07 |
2019-08-29T13:37:04Z | [14932] Firing 1 - Gitaly error rate is too high: 8.58 |
2019-08-29T13:48:36Z | [14933] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-29T18:56:06Z | [14935] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-08-29T19:13:56Z | [14936] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-29T20:59:28Z | [14937] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-08-29T23:42:28Z | [14939] Firing 1 - GitLab Pages staging front-end IP could've been changed |
2019-08-30T11:20:28Z | [14940] Firing 1 - Chef client failures have reached critical levels |
2019-08-30T11:52:57Z | [14941] Firing 1 - The haproxy service is less available than normal |
2019-08-30T13:24:17Z | [14944] Firing 2 - PrometheusUnreachable |
2019-08-30T15:05:59Z | [14945] Firing 1 - The haproxy service is less available than normal |
2019-08-30T17:19:16Z | [14947] Firing 1 - prometheus is unreachable |
2019-08-31T09:49:42Z | [14948] Firing 2 - PrometheusUnreachable |
2019-08-31T11:36:55Z | [14949] Firing 1 - The metrics service is less available than normal |
2019-09-02T12:52:20Z | [14950] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2019-09-02T16:12:41Z | [14951] Firing 1 - Postgres seems to be processing very few transactions |
2019-09-02T16:30:21Z | [14952] Firing 1 - Last successful WALE basebackup was seen 8m 6s ago. |
2019-09-03T00:30:29Z | [14953] Firing 1 - Last successful WALE basebackup was seen 8m 14s ago. |
2019-09-03T02:28:47Z | [14954] Firing 1 - GitLab Camoproxy is down for 2 minutes |
2019-09-03T02:28:48Z | [14955] Firing 1 - GitLab Camoproxy is not responding correctly for 2 minutes |
2019-09-03T06:08:20Z | [14956] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 11
- Incident Issues : 6
- CorrectiveAction Issues : 1
Change Issues
- 2019-09-02T14:04:16Z - Create index to fix admin dashboard - abrandl
- 2019-09-02T13:06:40Z - Move pipeline jobs to pipeline fleet - andrewn
- 2019-09-02T09:52:46Z - Use 2 pgbouncers per replica in production - ahmadsherif
- 2019-08-30T08:44:53Z - Move archived logs to coldline storage - craigf
- 2019-08-30T08:09:13Z - Add 3 more pipeline sidekiq nodes in production - andrewn
- 2019-08-29T17:02:18Z - Restore
patroni-01
to a useful state - nnelson - 2019-08-29T15:20:24Z - Update secret configuration for production registry GKE - unassigned
- 2019-08-29T14:50:46Z - Loss of visibility into gitlab-monitor metrics - unassigned
- 2019-08-29T09:35:07Z - switch logging to the new ES7 clusters - mwasilewski-gitlab
- 2019-08-28T04:41:39Z - Increase unicorn workers on git frontend nodes - cmiskell
- 2019-08-27T14:44:48Z - Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 - unassigned
Incident Issues
- 2019-09-02T18:22:03Z - 2019-09-02: Staging logs were inaccessible - ahmadsherif | ~S4 | ~"Service:ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1113
- 2019-08-31T19:47:12Z - Attempted hijack of GCP project/domain - alejandro | | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1107
- 2019-08-30T10:35:27Z - Image upload links not generated properly from Markdown for some users - unassigned | ~S2 | ~"Service:GitLab Rails" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1105
- 2019-08-29T13:44:49Z - 2019-08-29: Partial degradation on file-06 due to possible abuse - ahmadsherif | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1099
- 2019-08-28T22:34:45Z - It appears that
git fetch
no longer works on gitlab-ce project - nnelson | ~S2 | ~"Service:Git" |https://gitlab.com/gitlab-com/gl-infra/production/issues/1097
- 2019-08-27T19:43:50Z - Database failover - nnelson | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1094
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 16
- Change issues : 18
- Incident issues : 0
- Access Request : 4
- CorrectiveAction : 91
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-02T09:52:46Z | ahmadsherif | Use 2 pgbouncers per replica in production |
2019-08-29T17:02:18Z | nnelson | Restore patroni-01 to a useful state |
2019-08-29T14:50:46Z | unassigned | Loss of visibility into gitlab-monitor metrics |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-28T04:41:39Z | cmiskell | Increase unicorn workers on git frontend nodes |
2019-08-27T14:44:48Z | unassigned | Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-08-13T21:05:45Z | unassigned | Tweaking (decreasing) idle_in_transaction_session_timeout on Production |
2019-08-06T10:23:45Z | adescoms | Force eager provisioning of GCP disks after size increase |
2019-08-02T18:52:15Z | nnelson | Migrate large projects off file-28-stor-gprd to file-36-stor-gprd |
2019-08-02T18:51:05Z | nnelson | Migrate large projects off file-24-stor-gprd to file-36-stor-gprd |
2019-08-01T21:29:30Z | nnelson | Migrate large projects off file-25-stor-gprd to file-36-stor-gprd |
2019-07-16T16:23:18Z | cmcfarland | Enable pages access control setting in gitlab.rb |
2019-07-13T01:03:58Z | msmiley | Prevent residual HAProxy processes by setting hard-stop-after
|
2019-07-09T03:15:10Z | cmiskell | Enable camoproxy functionality |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-08-08T15:19:47Z | hphilipps | RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow |
2019-08-05T19:49:35Z | ansdval | RCA: Note Creation on commit via API Calls Halted new_note Sidekiq Queue |
2019-08-01T15:58:00Z | hphilipps | RCA Gitaly latency below SLO |
2019-08-01T04:47:51Z | unassigned | update GitLab.com to not require 2FA on the Gitlab.com side for SAML Logins via Okta |
2019-07-31T12:55:40Z | hphilipps | Alert on gitaly file descriptors |
2019-07-30T09:03:49Z | unassigned | Update Snowplow credentials on customers.gitlab.com |
2019-07-03T20:00:25Z | cmiskell | DNS: Wildcard record for "serverless-evaluation.sec.gitlab.net" |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-18T01:44:01Z | unassigned | Review load-balancing configuration for registry |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-06-11T11:30:31Z | unassigned | no alert for customer.gitlab.com being down |
2019-06-04T17:29:12Z | ahanselka | Whitelist Atlasssian IP address Space for API Calls |
2019-05-27T14:16:28Z | unassigned | Automate SSLMate Certificate grabs |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | unassigned | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith