Skip to content

Observability::AlertQueryWorker causing "Failed to open TCP connection" in sidekiq logs without network route to GOB

Problem Statement

A few customer's with air-gapped GitLab deployments have reported that since upgrading to GitLab 17.4 they are finding the following error messages in their sidekiq logs:

Click to expand { "severity": "ERROR", "time": "2024-09-20T12:15:12.857Z", "meta.feature_category": "metrics", "correlation_id": "7b250aa86a7c864834bc50f142551778", "meta.caller_id": "Observability::AlertQueryWorker", "exception.class": "Net::OpenTimeout", "exception.message": "Failed to open TCP connection to 172.64.155.229:443 (execution expired)", "exception.backtrace": [ "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `initialize'", "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `open'", "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `block in connect'", "timeout (0.3.2) lib/timeout.rb:189:in `block in timeout'", "timeout (0.3.2) lib/timeout.rb:196:in `timeout'", "gems/gitlab-http/lib/net_http/connect_patch.rb:50:in `connect'", "gems/gitlab-http/lib/gitlab/http_v2/net_http_adapter.rb:24:in `connect'", "net-http (0.4.1) lib/net/http.rb:1580:in `do_start'", "net-http (0.4.1) lib/net/http.rb:1569:in `start'", ...

It appears that the worker observability_alert_query_worker is attempting to open a connection to observe.gitlab.com. They effected customers were able to stop the errors by deleting the job from cron. It is back after sidekiq is restarted.

Reach

1.5 = Small reach (~5% to ~25%).

I've looked for more reports or issues and found only the 2 tickets of customers asking about this, and 1 other where it was found in the logs but not the subject of the investigation. Reports have in common that it started when upgrading to 17.4. Problem likely doesn't involve GitLab.com at all; only self-managed.

Impact

0.5 = Low impact 0.25 = Minimal impact

The only symptom is noise in logs. Customer's are curious about it's purpose. Customers have a viable workaround.

Confidence

50% = Low confidence

There are few reports and I don't know what that worker does.

Effort

  • If it is unintentional, stop the worker from connecting to observe.gitlab.com for self-managed deployments - 0.5
  • If it's intentional, create documentation that explains what it is and what it does so that customers who detect it aren't concerned about privacy 0.5
  • Add a way for users to disable it permanently

Definition of Done

  • [ ] The problem is well understood by the PM to decide if they want to move forward with this idea or drop it
  • [ ] The problem is well described and detailed with necessary requirements for product design to understand the problem
  • [ ] The problem is well described and detailed with necessary requirements for engineering to understand the problem

Zendesk tickets

Edited by Chris Nightingale