Observability::AlertQueryWorker causing "Failed to open TCP connection" in sidekiq logs without network route to GOB
Problem Statement
A few customer's with air-gapped GitLab deployments have reported that since upgrading to GitLab 17.4 they are finding the following error messages in their sidekiq
logs:
Click to expand
{ "severity": "ERROR", "time": "2024-09-20T12:15:12.857Z", "meta.feature_category": "metrics", "correlation_id": "7b250aa86a7c864834bc50f142551778", "meta.caller_id": "Observability::AlertQueryWorker", "exception.class": "Net::OpenTimeout", "exception.message": "Failed to open TCP connection to 172.64.155.229:443 (execution expired)", "exception.backtrace": [ "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `initialize'", "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `open'", "gems/gitlab-http/lib/net_http/connect_patch.rb:52:in `block in connect'", "timeout (0.3.2) lib/timeout.rb:189:in `block in timeout'", "timeout (0.3.2) lib/timeout.rb:196:in `timeout'", "gems/gitlab-http/lib/net_http/connect_patch.rb:50:in `connect'", "gems/gitlab-http/lib/gitlab/http_v2/net_http_adapter.rb:24:in `connect'", "net-http (0.4.1) lib/net/http.rb:1580:in `do_start'", "net-http (0.4.1) lib/net/http.rb:1569:in `start'", ...It appears that the worker observability_alert_query_worker
is attempting to open a connection to observe.gitlab.com
. They effected customers were able to stop the errors by deleting the job from cron. It is back after sidekiq
is restarted.
Reach
1.5 = Small reach (~5% to ~25%).
I've looked for more reports or issues and found only the 2 tickets of customers asking about this, and 1 other where it was found in the logs but not the subject of the investigation. Reports have in common that it started when upgrading to 17.4. Problem likely doesn't involve GitLab.com at all; only self-managed.
Impact
0.5 = Low impact 0.25 = Minimal impact
The only symptom is noise in logs. Customer's are curious about it's purpose. Customers have a viable workaround.
Confidence
50% = Low confidence
There are few reports and I don't know what that worker does.
Effort
- If it is unintentional, stop the worker from connecting to
observe.gitlab.com
for self-managed deployments - 0.5 - If it's intentional, create documentation that explains what it is and what it does so that customers who detect it aren't concerned about privacy 0.5
- Add a way for users to disable it permanently
Definition of Done
- [ ] The problem is well understood by the PM to decide if they want to move forward with this idea or drop it
- [ ] The problem is well described and detailed with necessary requirements for product design to understand the problem
- [ ] The problem is well described and detailed with necessary requirements for engineering to understand the problem