OnCall report for period: 2019-09-10 - 2019-09-17
Oncall during this period
Schedule | Username |
---|---|
SRE | Devin Sylva |
SRE | Amar Amarsanaa |
SRE | Henri Philipps |
PagerDuty Incidents
Show/Hide Table
Created | Summary |
---|---|
2019-09-10T09:40:47Z | [15033] Firing 1 - SSL certificate for https://sentry.gitlab.net expires in 4d 14h 19m 58s |
2019-09-10T18:59:44Z | [15035] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-11T03:38:27Z | [15037] Firing 1 - Gitaly is down on file-35-stor-gprd.c.gitlab-production.internal |
2019-09-11T06:22:56Z | [15038] Firing 1 - The frontend service is less available than normal |
2019-09-11T15:58:42Z | [15041] Firing 2 - PrometheusNotConnectedToAlertmanagers |
2019-09-11T15:59:56Z | [15042] Firing 1 - Gitaly is down on file-33-stor-gprd.c.gitlab-production.internal |
2019-09-11T16:46:09Z | [15043] Firing 1 - Increased Error Rate Across Fleet |
2019-09-11T16:46:51Z | [15044] Firing 1 - Increased Error Rate Across Fleet |
2019-09-11T16:54:10Z | [15046] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/tree/master is down |
2019-09-11T16:55:21Z | [15047] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-09-11T16:56:07Z | [15048] Firing 1 - Increased Error Rate Across Fleet |
2019-09-11T18:28:10Z | [15049] Firing 1 - Increased Error Rate Across Fleet |
2019-09-11T19:50:16Z | [15051] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2019-09-11T23:36:28Z | [15052] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2019-09-12T10:32:56Z | [15054] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica |
2019-09-12T11:48:44Z | [15055] Firing 1 - PostgreSQL replication slot patroni_03_db_gprd_c_gitlab_production_internal on patroni-01-db-gprd.c.gitlab-production.internal is |
falling behind. |
| | 2019-09-12T12:18:13Z | [15056] Firing 1 - Gitaly is down on file-39-stor-gprd.c.gitlab-production.internal | | 2019-09-12T17:14:40Z | [15057] Firing 2 - IncreasedBackendConnectionErrors | | 2019-09-12T17:17:36Z | [15058] Firing 2 - RegistryDown | | 2019-09-12T19:35:08Z | [15059] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-13T00:30:32Z | [15061] Firing 2 - SSLCertExpiresSoon | | 2019-09-13T07:22:21Z | [15063] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-13T11:12:56Z | [15064] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-13T12:11:27Z | [15065] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-13T14:25:27Z | [15066] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-13T22:37:24Z | [15068] Firing 1 - 1% disk space left | | 2019-09-14T09:32:55Z | [15070] Firing 1 - Gitaly is down on file-15-stor-gprd.c.gitlab-production.internal | | 2019-09-16T08:36:41Z | [15075] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag | | 2019-09-16T09:34:05Z | [15076] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-16T11:21:26Z | [15077] Firing 1 - Gitaly is down on file-20-stor-gprd.c.gitlab-production.internal | | 2019-09-16T12:56:31Z | [15078] Firing 1 - Chef client failures have reached critical levels | | 2019-09-16T13:22:41Z | [15079] Firing 1 - Postgres Replication lag is over 1 hour on archive recovery replica | | 2019-09-17T05:57:28Z | [15080] Firing 1 - Gitaly is down on file-33-stor-gprd.c.gitlab-production.internal | | 2019-09-17T05:59:06Z | [15081] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers | | 2019-09-17T06:02:08Z | [15082] Firing 1 - High Rails Error Rate on Front End |
7 Day Issue Stats
- Oncall issues : 5
- Access Request : 0
- Change Issues : 9
- Incident Issues : 3
- CorrectiveAction Issues : 0
Change Issues
- 2019-09-17T07:40:45Z - Enable debug headers in HAProxy - T4cC0re
- 2019-09-17T02:30:11Z - Enable global serial-port-logging to stackdriver on gstg + gprd GCP projects - cmiskell
- 2019-09-16T18:32:57Z - Migrate large projects off file-30-stor-gprd to file-39-stor-gprd - nnelson
- 2019-09-16T17:10:18Z - Run chef with new us-central role in ops-us-central - cmcfarland
- 2019-09-16T09:03:25Z - Enable Let's Encrypt integration for gstg - T4cC0re
- 2019-09-16T07:42:07Z - Stand up a dedicated export sidekiq fleet - andrewn
- 2019-09-13T17:32:42Z - Migrate large projects off file-27-stor-gprd to file-38-stor-gprd - nnelson
- 2019-09-12T08:00:43Z - Add more hosts to the pipeline fleet - andrewn
- 2019-09-11T20:08:21Z - Disable Elasticsearch search functionality for S1/P1 mitigation - devin
Incident Issues
- 2019-09-12T17:25:42Z - Registry returning 503 errors - unassigned | | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1148
- 2019-09-11T18:13:52Z - 500 Errors on Comments, issues and Todo's - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1143
- 2019-09-11T16:57:27Z - EE and CE projects not loading - unassigned | ~S3 | ~"Service:Gitaly" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1142
CorrectiveAction Issues
- 2019-09-13T15:42:06Z - Consider the use of a Docker image to contain tooling for local workstation use of Kubernetes - skarbek
- 2019-09-12T20:19:05Z - Investigate creating a kubectl wrapper script with a production warning - unassigned
- 2019-09-12T20:12:38Z - Investigate the ability to utilize kubectl from a bastion or proxy node - skarbek
- 2019-09-12T20:04:37Z - Investigate limiting cluster permission for Engineers - unassigned
- 2019-09-12T19:57:26Z - Lockdown our Kubernetes Clusters to only a specific set of IP addresses - unassigned
Open Issue Stats
- Oncall issues : 11
- Change issues : 15
- Incident issues : 0
- Access Request : 4
- CorrectiveAction : 90
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-17T02:30:11Z | cmiskell | Enable global serial-port-logging to stackdriver on gstg + gprd GCP projects |
2019-09-16T18:32:57Z | nnelson | Migrate large projects off file-30-stor-gprd to file-39-stor-gprd |
2019-09-16T17:10:18Z | cmcfarland | Run chef with new us-central role in ops-us-central |
2019-09-12T08:00:43Z | andrewn | Add more hosts to the pipeline fleet |
2019-09-09T20:43:17Z | ahanselka | Experimental Changes to runner configuration - grow understanding on runner cron issues. |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-27T14:44:48Z | asaba | Enable additional reCAPTCHA protection for Credential Stuffing in 12.2.3 |
2019-08-16T12:04:45Z | unassigned | Remove patroni-01 from the failover selection. |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-08-13T21:05:45Z | ahmadsherif | Tweaking (decreasing) idle_in_transaction_session_timeout on Production |
2019-08-06T10:23:45Z | adescoms | Force eager provisioning of GCP disks after size increase |
2019-07-16T16:23:18Z | cmcfarland | Enable pages access control setting in gitlab.rb |
2019-07-13T01:03:58Z | msmiley | Prevent residual HAProxy processes by setting hard-stop-after
|
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-09-14T10:06:43Z | hphilipps | file-15-stor-gprd rebooted |
2019-09-12T12:34:42Z | hphilipps | file-33-stor-gprd rebooted |
2019-09-11T19:47:14Z | unassigned | Add more hosts to the pipeline fleet |
2019-09-11T04:19:29Z | hphilipps | file-35-stor-gprd rebooted |
2019-09-06T19:54:36Z | unassigned | GitLab CE review apps are failing to acquire External IP for NGINX Ingress Controller |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-07-03T20:00:25Z | cmiskell | DNS: Wildcard record for "serverless-evaluation.sec.gitlab.net" |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
2019-03-25T17:19:06Z | nnelson | Many staging alerts still paging production |
2017-10-13T00:36:06Z | unassigned | security - add CAA records to DNS |
This issue was automatically generated using oncall-robot-assistant