GRPC::DeadlineExceeded in Clusters::Agents::NotifyGitPushWorker

added devopsdeploy groupenvironments sectioncd typebug labels

added to epic &6730

mentioned in merge request !126405 (merged)

This is not ok, obviously. The only thing kas does is it publishes the event into a Redis topic. Traces for the RPC that are slower than 0.1s:

As you can see, the slowest one is 221ms during the last 2 days. This means that:

The problematic calls were not traced (we sample at 10% rate).
The problem happens before requests reach kas.

Any ideas on how to debug this from the Rails side? Timeout on the Rails side is 2 seconds.

@shinya.maeda Maybe you have some thoughts on this issue?

@ash2k I can look into this issue.

@timofurrer It seems you put emoji. I'm not sure if you're working on this issue already. If not, please let me know.

@shinya.maeda I wasn't, sorry for the confusion.

changed milestone to %Backlog

assigned to @shinya.maeda

Sidekiq Worker details: https://dashboards.gitlab.net/d/sidekiq-worker-detail/sidekiq-worker-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-worker=Clusters::Agents::NotifyGitPushWorker
Worker log: https://log.gprd.gitlab.net/app/r/s/ac4QI
Worker errors: https://log.gprd.gitlab.net/app/r/s/3ZycG

As of today, the error rate is 30 / 1,670,767 = 0.001795582 % (per day)

source

At gitlab.com scale, it's quite common that a low amount of sidekiq jobs fail by intermittent errors, but even so our sidekiq is resilient with the built-in retry mechanism. In these failure cases, the failed jobs were retried and succeeded at the second attempt, example.

In addition, an error log indicates that there was an intermittent network issue inside the sidekiq worker nodes:

14:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.216.8.102:8153: Failed to connect to remote host: Connection refused. debug_error_string:{UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.216.8.102:8153: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-07-27T10:49:22.747141754+00:00"}}

source

cc @ash2k @tigerwnz @timofurrer

Thanks for looking into it! It's annoying, but I suppose we have more important things to worry about. I wonder if it's one of those cases where kas is being restarted but somehow Sidekiq is still trying to access the "old" Pod.

mentioned in issue gitlab-org/quality/triage-reports#13482 (closed)

unassigned @shinya.maeda

closed

I have a large, Premium customer experiencing this issue currently.

Use case: Our team is reporting seeing intermittent timeouts to GitLab Kubernetes servers that the GitLab k8s agents talk to at kas.gitlab.com Seems to be affecting other users too - Is this a known issue or something being worked out?

Emergency ZD Ticket

@aritra.dastidar @ash2k

Looking at the logs, I think the customer is experiencing CI tunnel routing timeouts (gitlab-org/cluster-integration/gitlab-agent#386 - closed), which I think you've found already. This issue is unrelated to customer's problem.

added customer label

reopened

I'm reopening since this is triggering monitoring alerts.

removed customer label

changed milestone to %16.3

mentioned in issue gitlab-org/quality/triage-reports#13601 (closed)

mentioned in issue gitlab-org/quality/triage-reports#13698 (closed)

mentioned in issue gitlab-org/quality/triage-reports#13762 (closed)

changed milestone to %16.4

added missed:16.3 label

mentioned in issue gitlab-org/quality/triage-reports#13854 (closed)

mentioned in issue gitlab-org/quality/triage-reports#13975 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14072 (closed)

mentioned in issue gitlab-org/ci-cd/deploy-stage/environments-group/general#34 (closed)

added workflowrefinement label

changed milestone to %Backlog

mentioned in issue gitlab-org/quality/triage-reports#14140 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14235 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14347 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14442 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14510 (closed)

Setting it to severity4 based on "I don't think these are anything to worry about as they pass when the job is retried" from the description

added severity4 label

mentioned in issue gitlab-org/quality/triage-reports#14607 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14672 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14791 (closed)

mentioned in issue gitlab-org/quality/triage-reports#14886 (closed)

mentioned in issue gitlab-com/gl-infra/delivery#19491 (closed)

Following the outdated bugs procedure, I'm marking this as awaiting feedback.

added awaiting feedback label

This is still happening. The failures are all recorded after 2-3 seconds of execution. IIRC gRPC timeout for monolith->kas is 10s. So, why are we getting a timeout?

mentioned in merge request !173255 (merged)

@ash2k is this issue still relevant? There is an open community MR for it but the reviewer is requesting documentation changes. Where should this be documented?
Thanks!

GRPC::DeadlineExceeded in Clusters::Agents::NotifyGitPushWorker

Designs

Child items ...

Activity