Duo Health Check does not work well for SHM

Problem

During recent support requests with SHM customers we found that there are several problems with health-checking SHM setups.

These issues can be broken down further into two major problem areas:

There are two independent health-check (HC) systems. We have a split-brain issue where some parts of the SHM configuration use the Duo HC that is also used for Cloud Connector, while other parts use a custom health-check.
1. The Duo HC (Admin > GitLab Duo > Run health check) uses the cloudConnectorStatus GraphQL call and related logic.
2. The SHM HC (Admin > Self-hosted models > Edit model > Test Connection) uses the aiSelfHostedModelConnectionCheck GraphQL call and related logic.
Both HCs have bugs that mislead users. See below.

Duo HC problems

Problems we found with the Duo HC when using SHM:

Uses wrong probe set. It switches to "self-hosted mode" based on a license check for an Offline Cloud License. This is not correct; these licenses are only provided to air-gapped customers, but we have customers who:
1. Are not air-gapped but still want to run their own AI gateway
2. Are not air-gapped and want to run both their own AI gateway and use the cloud-connected one
These customers will be using an Online Cloud License instead and synchronize with CustomersDot.
Cannot check hybrid setups. It operates globally at the instance level since it was built for Cloud Connector originally. However, SHM customers can exert fine-grained control over AI features and route only some features to a particular AI gateway, be that self-hosted or remote. The Health Check UI has no means of expressing this currently. It operates either fully in self-hosted mode, in which case it would health-check this AI gateway instance, or in cloud mode, where it health-checks our own systems.
The end-to-end test uses incorrect setup. The System exchange health check (EndToEndProbe) sends a code completion request via CodeSuggestionsClient#test_completion. However, the implementation of this method bypasses critical application logic that would be used under real-world feature usage with SHM, which leads to an incorrect token being used to send to the AI gateway. This results in an error: 401 Unauthorized. The problem is that it requests the code_suggestions service instead of the self_hosted_models service here.

Proposal

Problem 1: Use Ai::SelfHostedModel.any? or Ai::FeatureSetting.any? instead of license check?
Problem 2: This likely requires a design overhaul of the HC page?
Problem 3: We should stop patching up all possible entry points to AI logic with self_hosted_models? checks since this is brittle and it's likely that we will keep missing more of these going forward. We need a proper abstraction here that centralizes configuration and permission checks so that everything goes through the same code path.

SHM HC problems

~~The check uses incorrect setup.~~~~ Same problem as above with the Duo HC E2E probe, just that here we are using ~~CodeSuggestionsClient#test_model_connection~~ which also uses the wrong service name (code_suggestions instead of self_hosted_models).~~
Fixed with !182869 (merged)

References

https://docs.google.com/document/d/1GB3WIiJSzWWYlL6pdQ6Y5pble1ru0GC15w5yFGQHNa0/

Edited Feb 28, 2025 by Paul Phillips