Duo Health Check does not work well for SHM
Problem
During recent support requests with SHM customers we found that there are several problems with health-checking SHM setups.
These issues can be broken down further into two major problem areas:
-
There are two independent health-check (HC) systems. We have a split-brain issue where some parts of the SHM configuration use the Duo HC that is also used for Cloud Connector, while other parts use a custom health-check.
- The Duo HC (Admin > GitLab Duo > Run health check) uses the
cloudConnectorStatus
GraphQL call and related logic. - The SHM HC (Admin > Self-hosted models > Edit model > Test Connection) uses the
aiSelfHostedModelConnectionCheck
GraphQL call and related logic.
- The Duo HC (Admin > GitLab Duo > Run health check) uses the
- Both HCs have bugs that mislead users. See below.
Duo HC problems
Problems we found with the Duo HC when using SHM:
-
Uses wrong probe set. It switches to "self-hosted mode" based on a license check for an Offline Cloud License. This is not correct; these licenses are only provided to air-gapped customers, but we have customers who:
- Are not air-gapped but still want to run their own AI gateway
- Are not air-gapped and want to run both their own AI gateway and use the cloud-connected one
These customers will be using an
Online Cloud License
instead and synchronize with CustomersDot. -
Cannot check hybrid setups. It operates globally at the instance level since it was built for Cloud Connector originally. However, SHM customers can exert fine-grained control over AI features and route only some features to a particular AI gateway, be that self-hosted or remote. The Health Check UI has no means of expressing this currently. It operates either fully in self-hosted mode, in which case it would health-check this AI gateway instance, or in cloud mode, where it health-checks our own systems.
-
The end-to-end test uses incorrect setup. The
System exchange
health check (EndToEndProbe
) sends a code completion request via CodeSuggestionsClient#test_completion. However, the implementation of this method bypasses critical application logic that would be used under real-world feature usage with SHM, which leads to an incorrect token being used to send to the AI gateway. This results in an error:401 Unauthorized
. The problem is that it requests thecode_suggestions
service instead of theself_hosted_models
service here.
Proposal
- Problem 1: Use
Ai::SelfHostedModel.any?
orAi::FeatureSetting.any?
instead of license check? - Problem 2: This likely requires a design overhaul of the HC page?
- Problem 3: We should stop patching up all possible entry points to AI logic with
self_hosted_models?
checks since this is brittle and it's likely that we will keep missing more of these going forward. We need a proper abstraction here that centralizes configuration and permission checks so that everything goes through the same code path.
SHM HC problems
-
The check uses incorrect setup.~~ Same problem as above with the Duo HC E2E probe, just that here we are usingwhich also uses the wrong service name (CodeSuggestionsClient#test_model_connectioncode_suggestions
instead ofself_hosted_models
).~~ - Fixed with !182869 (merged)