GRPC::DeadlineExceeded in Clusters::Agents::NotifyGitPushWorker
Reported in !126405 (comment 1488032556) by @tigerwnz:
We appear to get occasional (a handful per day) timeouts from this job, see https://log.gprd.gitlab.net/app/r/s/JsXxv:
4:Deadline Exceeded. debug_error_string:{UNKNOWN:Deadline Exceeded {created_time:"2023-07-25T19:58:12.231442343+00:00", grpc_status:4}}
I don't think these are anything to worry about as they pass when the job is retried, but perhaps you have some ideas about what the cause might be.
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Mikhail Mazurskiy added devopsdeploy groupenvironments sectioncd typebug labels
added devopsdeploy groupenvironments sectioncd typebug labels
- Mikhail Mazurskiy added to epic &6730
added to epic &6730
- Mikhail Mazurskiy mentioned in merge request !126405 (merged)
mentioned in merge request !126405 (merged)
- Author Maintainer
This is not ok, obviously. The only thing kas does is it publishes the event into a Redis topic. Traces for the RPC that are slower than 0.1s:
As you can see, the slowest one is 221ms during the last 2 days. This means that:
- The problematic calls were not traced (we sample at 10% rate).
- The problem happens before requests reach kas.
Edited by Mikhail Mazurskiy - Author Maintainer
Any ideas on how to debug this from the Rails side? Timeout on the Rails side is 2 seconds.
Collapse replies - Author Maintainer
@shinya.maeda Maybe you have some thoughts on this issue?
- Maintainer
@ash2k I can look into this issue.
@timofurrer It seems you put
emoji. I'm not sure if you're working on this issue already. If not, please let me know. - Maintainer
@shinya.maeda I wasn't, sorry for the confusion.
1
- Shinya Maeda changed milestone to %Backlog
changed milestone to %Backlog
- Shinya Maeda assigned to @shinya.maeda
assigned to @shinya.maeda
- Maintainer
- Sidekiq Worker details: https://dashboards.gitlab.net/d/sidekiq-worker-detail/sidekiq-worker-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-worker=Clusters::Agents::NotifyGitPushWorker
- Worker log: https://log.gprd.gitlab.net/app/r/s/ac4QI
- Worker errors: https://log.gprd.gitlab.net/app/r/s/3ZycG
- Maintainer
As of today, the error rate is 30 / 1,670,767 = 0.001795582 % (per day)
At gitlab.com scale, it's quite common that a low amount of sidekiq jobs fail by intermittent errors, but even so our sidekiq is resilient with the built-in retry mechanism. In these failure cases, the failed jobs were retried and succeeded at the second attempt, example.
In addition, an error log indicates that there was an intermittent network issue inside the sidekiq worker nodes:
14:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.216.8.102:8153: Failed to connect to remote host: Connection refused. debug_error_string:{UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.216.8.102:8153: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-07-27T10:49:22.747141754+00:00"}}
1 Collapse replies - Author Maintainer
Thanks for looking into it! It's annoying, but I suppose we have more important things to worry about. I wonder if it's one of those cases where kas is being restarted but somehow Sidekiq is still trying to access the "old" Pod.
1
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13482 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13482 (closed)
- Shinya Maeda unassigned @shinya.maeda
unassigned @shinya.maeda
- Mikhail Mazurskiy closed
closed
- Developer
I have a large, Premium customer experiencing this issue currently.
Use case: Our team is reporting seeing intermittent timeouts to GitLab Kubernetes servers that the GitLab k8s agents talk to at kas.gitlab.com Seems to be affecting other users too - Is this a known issue or something being worked out?
1 Collapse replies - Author Maintainer
Looking at the logs, I think the customer is experiencing CI tunnel routing timeouts (gitlab-org/cluster-integration/gitlab-agent#386 - closed), which I think you've found already. This issue is unrelated to customer's problem.
- 🤖 GitLab Bot 🤖 added customer label
added customer label
- Mikhail Mazurskiy reopened
reopened
- Author Maintainer
I'm reopening since this is triggering monitoring alerts.
- Mikhail Mazurskiy removed customer label
removed customer label
- Mikhail Mazurskiy changed milestone to %16.3
changed milestone to %16.3
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13601 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13601 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13698 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13698 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13762 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13762 (closed)
- 🤖 GitLab Bot 🤖 changed milestone to %16.4
changed milestone to %16.4
- 🤖 GitLab Bot 🤖 added missed:16.3 label
added missed:16.3 label
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13854 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13854 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13975 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13975 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14072 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14072 (closed)
- Ghost User mentioned in issue gitlab-org/ci-cd/deploy-stage/environments-group/general#34 (closed)
mentioned in issue gitlab-org/ci-cd/deploy-stage/environments-group/general#34 (closed)
- Nicolò Maria Mezzopera added workflowrefinement label
added workflowrefinement label
- Nicolò Maria Mezzopera changed milestone to %Backlog
changed milestone to %Backlog
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14140 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14140 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14235 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14235 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14347 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14347 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14442 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14442 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14510 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14510 (closed)
- Developer
Setting it to severity4 based on "I don't think these are anything to worry about as they pass when the job is retried" from the description
- Viktor Nagy (GitLab) added severity4 label
added severity4 label
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14607 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14607 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14672 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14672 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14791 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14791 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#14886 (closed)
mentioned in issue gitlab-org/quality/triage-reports#14886 (closed)
- Brittany Wilkerson mentioned in issue gitlab-com/gl-infra/delivery#19491 (closed)
mentioned in issue gitlab-com/gl-infra/delivery#19491 (closed)
- Developer
Following the outdated bugs procedure, I'm marking this as awaiting feedback.
- Viktor Nagy (GitLab) added awaiting feedback label
added awaiting feedback label
- Author Maintainer
This is still happening. The failures are all recorded after 2-3 seconds of execution. IIRC gRPC timeout for monolith->kas is 10s. So, why are we getting a timeout?
- Filippo Merante Caparrotta mentioned in merge request !173255 (merged)
mentioned in merge request !173255 (merged)
- Developer
@ash2k is this issue still relevant? There is an open community MR for it but the reviewer is requesting documentation changes. Where should this be documented?
Thanks!