2021-09-19: The grpc_requests SLI of the kas service (`main` stage) has an error rate violating SLO
Current Status
A brief but dramatic dip in service apdex for the file-38
Gitaly shard, correlated with a spike in the KAS service error ratio.
No recurrence of the event has been encountered.
This appears to have been a single user making a series of large commits to a largish project (a linux kernel fork with 4.4 GB worth of files).
Timeline
Recent Events (available internally only):
All times UTC.
2021-09-19
-
21:50
- @nnelson declares incident in Slack.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- ...
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
- Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
- If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
- Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- ...
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
No timeline items have been added yet.
- Show closed items
Relates to
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- ops-gitlab-net assigned to @nnelson and @kwanyangu
assigned to @nnelson and @kwanyangu
- ops-gitlab-net changed the severity to Medium - S3
changed the severity to Medium - S3
- Author Owner
- Nels Nelson changed the description
Compare with previous version changed the description
- ops-gitlab-net mentioned in issue on-call-handovers#2045 (closed)
mentioned in issue on-call-handovers#2045 (closed)
- Contributor
Collapse replies - Contributor
- 🤖 GitLab Bot 🤖 added NeedsRootCause label
added NeedsRootCause label
- Maintainer
Production checks fail because there are blockers
Production has no active incidents
Production has change requests in progress
GitLab Deployment Health Status - overview-
All services healthy
no active deployment -
- Contributor
Collapse replies - Contributor
These appear to have been ten invocations of
gitlab.agent.reverse_tunnel.rpc.ReverseTunnel
within a single 2 millisecond interval.Edited by Nels Nelson
- Contributor
The alerts have resolved, and the event pattern has not recurred. Marking as mitigated.
- Nels Nelson added IncidentMitigated label and removed IncidentActive label
added IncidentMitigated label and removed IncidentActive label
- Contributor
- Nels Nelson changed the description
Compare with previous version changed the description
- Contributor
Collapse replies - Contributor
Which appear to be mostly
/gitaly.RepositoryService/WriteRef
invocations for this project: https://log.gprd.gitlab.net/goto/9f5db26714ff4bca1bf55d49f7133b71 - Contributor
- Contributor
- Nels Nelson changed the description
Compare with previous version changed the description
- Nels Nelson added RootCauseSaturation label
added RootCauseSaturation label
- Nels Nelson closed
closed
- Nels Nelson added IncidentResolved label and removed IncidentMitigated label
added IncidentResolved label and removed IncidentMitigated label
- Contributor
There appear to have been a handful of commits from a particular user on a large linux kernel fork project, along with multiple branch deletions in a very short time frame.
- 🤖 GitLab Bot 🤖 removed NeedsRootCause label
removed NeedsRootCause label
- Craig Miskell reopened
reopened
- Nels Nelson closed
closed
- Owner
Occurring again (and we were getting this on and off for a few hours in the small hours of 2021-09-19 UTC).
There's some indications that the problem is receiving 502s from what I believe is internal calls the API (int.gprd.gitlab.net):
mainly for the reverse tunnel, but also others at the same time:
Collapse replies - Owner
int.api.gprd.net is an internal load balancer to our front-end haproxy nodes. And a quick (and lucky) grep on fe-01 gives us this:
Sep 20 00:01:06 fe-01-lb-gprd haproxy[23422]: 10.222.44.126:42168 [20/Sep/2021:00:01:05.421] https~ TLSv1.3 api_rate_limit/localhost 3/0/1/1110/1114 502 530 505 - - ---- 11077/5987/4871/4870/0 0/0 {gitlab-kas/v14.2.2/2b22fa6} "GET /api/v4/internal/kubernetes/agent_info HTTP/1.1"
which matches up directly with the most recent alert I received. Inspecting the logs on one node manually, we see a burst of 502s:
sudo zgrep " 502 " /var/log/haproxy.log.1.gz|awk '{print $3}'|sort|uniq -c <snip> 5 00:01:01 2 00:01:02 2 00:01:03 4 00:01:04 27 00:01:05 33 00:01:06 1 00:01:07 30 00:01:08 29 00:01:09 32 00:01:10 2 00:01:11 2 00:01:12 2 00:01:13 5 00:01:14 <snip>
over the space of 6 seconds (it never goes about single digits the rest of the hour in those logs). Metrics only show us aggregation to 5xx level, but we can see the same spike there:
There's other spikes, but the key fact about the one at 00:01 is the affected nodes; it's not visible in the screenshot, but they are fe-01, fe-04, fe-07, fe-10, fe-13, fe-16, which are the ones in a single zone (us-east1-c).
The incident declared from 21:35 was on 02, 05, 08, 11, 14, 17 (zone us-east1-d).
What we don't see is 502s from workhorse at the relevant time:
Which suggests nginx is returning these 502s, because it couldn't (or thought it couldn't) talk to workhorse.
Edited by Craig Miskell - Owner
A suspiciously smoking gun in the general vicinity:
Not visible in the screenshot:
Sep 20, 2021 @ 00:00:51.000 Created pod: gitlab-nginx-ingress-controller-5c5cc67d85-mdxqk Sep 20, 2021 @ 00:00:51.000 New size: 27; reason: cpu resource utilization (percentage of request) above target
and other logs indicating a scaling event on the nginx ingress controller on the specific zonal cluster, about 15 seconds before the 502s are logged. Once is an accident, but when we see the same form on us-east1-d at 21:35 when the previous incident occurred (https://log.gprd.gitlab.net/goto/e05256d45e7667b02c3ffa1837dafdec) and another that paged for an event at 2021-09-19 06:05 UTC which affected us-east1-b, (https://thanos.gitlab.net/graph?g0.expr=sum(rate(grpc_server_handled_total%7Benv%3D%22gprd%22%2C%20job%3D%22gitlab-kas%22%2C%20grpc_code%3D%22Unavailable%22%7D%5B1m%5D))%20by%20(grpc_service)&g0.tab=0&g0.stacked=0&g0.range_input=30m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2021-09-19%2006%3A18%3A34&g0.moment_input=2021-09-19%2006%3A18%3A34) we see the exact same activity (https://log.gprd.gitlab.net/goto/cd64b7b0d6feb3bb75a5b43ea002c49c).
So, nginx-ingress-controller scaling events are resulting in 502s being thrown up to haproxy and further to clients; KAS is noticing/alerting due to the low rate of traffic through it, but this is affecting all API clients (public included), for brief bursts. Seems like something we should get on top of.
- Craig Miskell reopened
reopened
- Craig Miskell added IncidentActive label and removed IncidentResolved label
added IncidentActive label and removed IncidentResolved label
- Developer
I thought I would bring this previously closed incident to your attention just in case it is helpful.
1 - Craig Miskell added IncidentMitigated label and removed IncidentActive label
added IncidentMitigated label and removed IncidentActive label
- ops-gitlab-net mentioned in issue on-call-handovers#2046 (closed)
mentioned in issue on-call-handovers#2046 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2047 (closed)
mentioned in issue on-call-handovers#2047 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2048 (closed)
mentioned in issue on-call-handovers#2048 (closed)
- Mikhail Mazurskiy mentioned in issue gitlab-org/cluster-integration/gitlab-agent#166 (closed)
mentioned in issue gitlab-org/cluster-integration/gitlab-agent#166 (closed)
- Mikhail Mazurskiy marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#166 (closed)
marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#166 (closed)
- Mikhail Mazurskiy marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#157 (closed)
marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#157 (closed)
- Mikhail Mazurskiy marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#159 (closed)
marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#159 (closed)
- Mikhail Mazurskiy marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#167 (closed)
marked this issue as related to gitlab-org/cluster-integration/gitlab-agent#167 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2049 (closed)
mentioned in issue on-call-handovers#2049 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2050 (closed)
mentioned in issue on-call-handovers#2050 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2051 (closed)
mentioned in issue on-call-handovers#2051 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2052 (closed)
mentioned in issue on-call-handovers#2052 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2053 (closed)
mentioned in issue on-call-handovers#2053 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2054 (closed)
mentioned in issue on-call-handovers#2054 (closed)
- Mikhail Mazurskiy mentioned in issue #5573 (closed)
mentioned in issue #5573 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2055 (closed)
mentioned in issue on-call-handovers#2055 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2056 (closed)
mentioned in issue on-call-handovers#2056 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2057 (closed)
mentioned in issue on-call-handovers#2057 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2058 (closed)
mentioned in issue on-call-handovers#2058 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2059 (closed)
mentioned in issue on-call-handovers#2059 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2060 (closed)
mentioned in issue on-call-handovers#2060 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2061 (closed)
mentioned in issue on-call-handovers#2061 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2062 (closed)
mentioned in issue on-call-handovers#2062 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2063 (closed)
mentioned in issue on-call-handovers#2063 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2064 (closed)
mentioned in issue on-call-handovers#2064 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2065 (closed)
mentioned in issue on-call-handovers#2065 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2066 (closed)
mentioned in issue on-call-handovers#2066 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2067 (closed)
mentioned in issue on-call-handovers#2067 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2068 (closed)
mentioned in issue on-call-handovers#2068 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2069 (closed)
mentioned in issue on-call-handovers#2069 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2070 (closed)
mentioned in issue on-call-handovers#2070 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2071 (closed)
mentioned in issue on-call-handovers#2071 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2072 (closed)
mentioned in issue on-call-handovers#2072 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2073 (closed)
mentioned in issue on-call-handovers#2073 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2074 (closed)
mentioned in issue on-call-handovers#2074 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2075 (closed)
mentioned in issue on-call-handovers#2075 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2076 (closed)
mentioned in issue on-call-handovers#2076 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2077 (closed)
mentioned in issue on-call-handovers#2077 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2078 (closed)
mentioned in issue on-call-handovers#2078 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2079 (closed)
mentioned in issue on-call-handovers#2079 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2080 (closed)
mentioned in issue on-call-handovers#2080 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2081 (closed)
mentioned in issue on-call-handovers#2081 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2082 (closed)
mentioned in issue on-call-handovers#2082 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2083 (closed)
mentioned in issue on-call-handovers#2083 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2084 (closed)
mentioned in issue on-call-handovers#2084 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2085 (closed)
mentioned in issue on-call-handovers#2085 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2086 (closed)
mentioned in issue on-call-handovers#2086 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2087 (closed)
mentioned in issue on-call-handovers#2087 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2088 (closed)
mentioned in issue on-call-handovers#2088 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2089 (closed)
mentioned in issue on-call-handovers#2089 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2090 (closed)
mentioned in issue on-call-handovers#2090 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2091 (closed)
mentioned in issue on-call-handovers#2091 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2092 (closed)
mentioned in issue on-call-handovers#2092 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2093 (closed)
mentioned in issue on-call-handovers#2093 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2094 (closed)
mentioned in issue on-call-handovers#2094 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2095 (closed)
mentioned in issue on-call-handovers#2095 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2096 (closed)
mentioned in issue on-call-handovers#2096 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2097 (closed)
mentioned in issue on-call-handovers#2097 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2098 (closed)
mentioned in issue on-call-handovers#2098 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2099 (closed)
mentioned in issue on-call-handovers#2099 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2100 (closed)
mentioned in issue on-call-handovers#2100 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2101 (closed)
mentioned in issue on-call-handovers#2101 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2102 (closed)
mentioned in issue on-call-handovers#2102 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2103 (closed)
mentioned in issue on-call-handovers#2103 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2104 (closed)
mentioned in issue on-call-handovers#2104 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2105 (closed)
mentioned in issue on-call-handovers#2105 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2106 (closed)
mentioned in issue on-call-handovers#2106 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2107 (closed)
mentioned in issue on-call-handovers#2107 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2108 (closed)
mentioned in issue on-call-handovers#2108 (closed)
- ops-gitlab-net mentioned in issue on-call-handovers#2109 (closed)
mentioned in issue on-call-handovers#2109 (closed)
- Owner
With https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14210 as the corrective action, and the near-term plan to remove the nginx ingress I think it's safe to close this as resolved.
- John Jarvis closed
closed
- John Jarvis added IncidentResolved label and removed IncidentMitigated label
added IncidentResolved label and removed IncidentMitigated label
- Owner
To be clear to all on this issue, this is not related to ingress-nginx.
Kas does not use ingress-nginx at all. Ingress objects in Kubernetes may be implmented by different controllers implementing the ingress-specification, and you may have multiple controllers in use at the same time (which we do in all our clusters).
Kas uses the GKE Ingress the controller for which is installed by default in all our GKE clusters by default. The actual implementation uses a GCP HTTPS Load Balancer under the hood.
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14210 is not a corrective action for this, as there is no ingress-nginx involved. I don't know what the root cause was, but this is not it.
Perhaps we should do some general messaging to everyone to remind them that KAS, Plantuml, Prometheus and other services we deploy do use a Kubernetes
ingress
, but do not use ingress-nginx. - Owner
@ggillies What you say is mostly right, but you've missed a subtle point: KAS was reporting errors because it was getting 502s from calls back to the API (experiencing whatever is going on with ingress-nginx and scale-up/down events).
Collapse replies - Owner
@cmiskell doh thanks for the heads up, and yes that makes sense because the internal API is going through ingress-nginx (for the moment) so the original line of thinking is correct. My apologies to all, I didn't read closely enough!
- John Jarvis mentioned in issue reliability-reports#75 (closed)
mentioned in issue reliability-reports#75 (closed)