2024-07-11: BlackboxProbeFailures
Customer Impact
Current Status
More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.

References and helpful links

Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Security Note: If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.
No timeline items have been added yet.
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- A deleted user added IncidentActive ServiceBlackbox Source::IMAIncidentDeclare a:BlackboxProbeFailures incident severity4 labels
- Ghost User assigned to @swainaina and @sxuereb
assigned to @swainaina and @sxuereb
- Ghost User changed the description
Compare with previous version changed the description
- Ghost User changed the severity to Low - S4
changed the severity to Low - S4
- Ghost User added a resource link
added a resource link
- Ghost User added a resource link
added a resource link
- Owner
We have 2 endpoints that failed, but both endpoints load for me
https://gitlab.com/gitlab-org/gitlab-foss/-/issues/1
https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1
Edited by Steve Xuereb - Owner
Following how to find logs we see the following logs:
ts=2024-07-11T13:02:52.547099816Z caller=main.go:119 module=http_2xx target=https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1 level=info msg="Received HTTP response" status_code=500 ts=2024-07-11T13:02:52.547186941Z caller=main.go:119 module=http_2xx target=https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1 level=info msg="Invalid HTTP response status code, wanted 2xx" status_code=500
ts=2024-07-11T13:02:53.668941031Z caller=main.go:119 module=http_gitlab_com_auth_2xx target=https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/57 level=info msg="Received HTTP response" status_code=500 ts=2024-07-11T13:02:53.669009369Z caller=main.go:119 module=http_gitlab_com_auth_2xx target=https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/57 level=info msg="Invalid HTTP response status code, wanted 2xx" status_code=500
Collapse replies - Owner
It seems like it keeps get 500 erors:
ts=2024-07-11T13:05:03.739544998Z caller=main.go:119 module=http_2xx target=https://gitlab.com/gitlab-org/gitlab-foss/-/issues/1 level=info msg="Received HTTP response" status_code=500 ts=2024-07-11T13:05:03.739598747Z caller=main.go:119 module=http_2xx target=https://gitlab.com/gitlab-org/gitlab-foss/-/issues/1 level=info msg="Invalid HTTP response status code, wanted 2xx" status_code=500
- Owner
- Owner
- Maintainer
There was a deployment https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/3497013 that may have caused to this
deployer finished a deployer pipeline of 17.2.202407111000-deb4fbaf4da.ce9b4b95185 on gprd-cny which had end-to-end wall clock duration of 27 minutes (and sum of pipeline stage durations was 13 minutes)
Edited by Silvester Wainaina - Owner
Most of the errors are
NoMethodError
:undefined method `id' for nil:NilClass cache_key = format(GitlabSubscriptions::UserAddOnAssignment::USER_ADD_ON_ASSIGNMENT_CACHE_KEY, user_id: user.id) ^^^
- Maintainer
"exception.backtrace": [ "ee/lib/cloud_connector/base_available_service_data.rb:61:in `add_on_purchases_assigned_to'", "ee/lib/cloud_connector/base_available_service_data.rb:25:in `allowed_for?'", "ee/app/policies/ee/issue_policy.rb:16:in `block (2 levels) in <module:IssuePolicy>'", "app/models/ability.rb:89:in `allowed?'", "app/controllers/application_controller.rb:213:in `can?'", "ee/app/controllers/ee/projects/issues_controller.rb:28:in `block (2 levels) in <module:IssuesController>'", "lib/gitlab/ip_address_state.rb:11:in `with'", "ee/app/controllers/ee/application_controller.rb:45:in `set_current_ip_address'", "app/controllers/application_controller.rb:469:in `set_current_admin'", "lib/gitlab/session.rb:11:in `with_session'", "app/controllers/application_controller.rb:460:in `set_session_storage'", "lib/gitlab/i18n.rb:114:in `with_locale'", "app/controllers/application_controller.rb:453:in `set_locale'", "app/controllers/application_controller.rb:444:in `set_current_context'", "ee/lib/omni_auth/strategies/group_saml.rb:41:in `other_phase'", "lib/gitlab/metrics/elasticsearch_rack_middleware.rb:16:in `call'", "lib/gitlab/middleware/sidekiq_shard_awareness_validation.rb:20:in `block in call'", "lib/gitlab/sidekiq_sharding/validator.rb:42:in `enabled'", "lib/gitlab/middleware/sidekiq_shard_awareness_validation.rb:20:in `call'", "lib/gitlab/middleware/memory_report.rb:13:in `call'", "lib/gitlab/middleware/speedscope.rb:13:in `call'", "lib/gitlab/database/load_balancing/rack_middleware.rb:23:in `call'", "lib/gitlab/middleware/rails_queue_duration.rb:33:in `call'", "lib/gitlab/etag_caching/middleware.rb:21:in `call'", "lib/gitlab/metrics/rack_middleware.rb:16:in `block in call'", "lib/gitlab/metrics/web_transaction.rb:46:in `run'", "lib/gitlab/metrics/rack_middleware.rb:16:in `call'", "lib/gitlab/middleware/go.rb:24:in `call'", "lib/gitlab/middleware/query_analyzer.rb:11:in `block in call'", "lib/gitlab/database/query_analyzer.rb:40:in `within'", "lib/gitlab/middleware/query_analyzer.rb:11:in `call'", "lib/gitlab/middleware/organizations/current.rb:20:in `call'", "lib/gitlab/middleware/multipart.rb:173:in `call'", "lib/gitlab/middleware/read_only/controller.rb:50:in `call'", "lib/gitlab/middleware/read_only.rb:18:in `call'", "lib/gitlab/middleware/unauthenticated_session_expiry.rb:18:in `call'", "lib/gitlab/middleware/same_site_cookies.rb:27:in `call'", "lib/gitlab/middleware/path_traversal_check.rb:34:in `call'", "lib/gitlab/middleware/handle_malformed_strings.rb:21:in `call'", "lib/gitlab/middleware/basic_health_check.rb:25:in `call'", "lib/gitlab/middleware/handle_ip_spoof_attack_error.rb:25:in `call'", "lib/gitlab/middleware/request_context.rb:15:in `call'", "lib/gitlab/middleware/webhook_recursion_detection.rb:15:in `call'", "config/initializers/fix_local_cache_middleware.rb:11:in `call'", "lib/gitlab/middleware/compressed_json.rb:44:in `call'", "lib/gitlab/middleware/rack_multipart_tempfile_factory.rb:19:in `call'", "lib/gitlab/middleware/sidekiq_web_static.rb:20:in `call'", "lib/gitlab/metrics/requests_rack_middleware.rb:79:in `call'", "lib/gitlab/middleware/release_env.rb:13:in `call'" ],
- Owner
A curl request logged out gives me a 500 error:
$ curl -s -I https://gitlab.com/gitlab-org/gitlab-foss/-/issues/1 -i --cookie 'gitlab_canary=true' HTTP/2 500
- Steve Xuereb added blocks deployments label
added blocks deployments label
- Owner
We are draining canary so I'm upgrading this to a severity3
- John Jarvis added severity3 label and removed severity4 label
- Ghost User changed the severity to Medium - S3
changed the severity to Medium - S3
- Owner
Adding blocks deployments because we don't want this deployment to go through
- Owner
We are disabling canary
Collapse replies - Owner
Job to disable canary: https://ops.gitlab.net/gitlab-com/chatops/-/jobs/14623204
- Owner
It seems like this merge request is causing the problem gitlab-org/gitlab!159065 (merged)
def clear_user_add_on_assigment_cache!(eligible_user_ids) cache_keys = eligible_user_ids.map do |user_id| format(GitlabSubscriptions::UserAddOnAssignment::USER_ADD_ON_ASSIGNMENT_CACHE_KEY, user_id: user_id) end
Collapse replies - Owner
Seems like I was wrong the actual code that is failing
cache_key = format(GitlabSubscriptions::UserAddOnAssignment::USER_ADD_ON_ASSIGNMENT_CACHE_KEY, user_id: user.id)
- Maintainer
This is where the method is called: https://gitlab.com/gitlab-org/gitlab/-/blame/master/ee/lib/cloud_connector/base_available_service_data.rb#L61
Edited by Silvester Wainaina - Owner
The merge request gitlab-org/gitlab!156650 (merged) which matches the stack trace
- Owner
- Developer
A fix MR: gitlab-org/gitlab!159104 (merged)
- Developer
- Owner
It appears the MR already has the ~"Pick into auto-deploy" applied. I believe this is the only special label needed to get this deployed.
Edited by Matt Miller
- Steve Xuereb mentioned in merge request gitlab-org/gitlab!156650 (merged)
mentioned in merge request gitlab-org/gitlab!156650 (merged)
- 🤖 GitLab Bot 🤖 added RootCauseNeeded label
added RootCauseNeeded label
- Ghost User mentioned in issue on-call-handovers#5121 (closed)
mentioned in issue on-call-handovers#5121 (closed)
- Mohamed Hamda mentioned in merge request gitlab-org/gitlab!159104 (merged)
mentioned in merge request gitlab-org/gitlab!159104 (merged)
- Jenny Kim added deployment-blocked label
added deployment-blocked label
- Jenny Kim mentioned in issue gitlab-org/release/tasks#11657 (closed)
mentioned in issue gitlab-org/release/tasks#11657 (closed)
- Owner
As of the most recent deployment, canary has been re-enabled, and I'm now seeing successful unauthenticated responses from the service:
> curl -s -I https://gitlab.com/gitlab-org/gitlab-foss/-/issues/1 -i --cookie 'gitlab_canary=true' HTTP/2 200
Collapse replies - Owner
Based on this I'm going to mark this incident as resolved.
- Ghost User mentioned in issue on-call-handovers#5122 (closed)
mentioned in issue on-call-handovers#5122 (closed)
- Adeline Yeung removed blocks deployments label
removed blocks deployments label
- Adeline Yeung removed deployment-blocked label
removed deployment-blocked label
- Adeline Yeung added IncidentResolved label and removed IncidentActive label
added IncidentResolved label and removed IncidentActive label
- Adeline Yeung added RootCauseSoftware-Change label and removed RootCauseNeeded label
added RootCauseSoftware-Change label and removed RootCauseNeeded label
- Adeline Yeung added ServiceAPI label and removed ServiceBlackbox label
added ServiceAPI label and removed ServiceBlackbox label
- Maintainer
This incident was automatically closed because it has the IncidentResolved label.
Note: All incidents are closed automatically when they are resolved, even when there is a pending review. Please see the Incident Workflow section on the Incident Management handbook page for more information.
- 🤖 GitLab Bot 🤖 closed
closed
- 🤖 GitLab Bot 🤖 changed the incident status to Resolved by closing the incident
changed the incident status to Resolved by closing the incident
- Nailia Iskhakova mentioned in issue gitlab-org/quality/quality-engineering/team-tasks#2859 (closed)
mentioned in issue gitlab-org/quality/quality-engineering/team-tasks#2859 (closed)
- Jenny Kim added Deploys-blocked-gprd10hr Deploys-blocked-gstg10hr labels
added Deploys-blocked-gprd10hr Deploys-blocked-gstg10hr labels
- Jenny Kim added deployment-blocked label
added deployment-blocked label
- GitLab Release Tools Bot mentioned in issue gitlab-org/release/tasks#11694 (closed)
mentioned in issue gitlab-org/release/tasks#11694 (closed)
- ops-gitlab-net mentioned in issue reliability-reports#249 (closed)
mentioned in issue reliability-reports#249 (closed)