[CC lib - Python] Audit behaviour when AIGW_CUSTOMER_PORTAL_URL is not set in backend
# Context We learned that for SHM, `AIGW_CUSTOMER_PORTAL_URL` is not set by design. But it may be missing (empty) or set to empty string. In the short term, we should audit the CC library code to have a clear understanding of what is the lib behavior based the `config.customer_portal_url` passed to `CompositeProvider` ([AI GW](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/blob/9efcb5af3f0a340784c02714805516914673d926/ai_gateway/api/server.py#L109)). # Problem We need to clearly understand the behavior when the URL is missing vs an empty string vs. some invalid value. That includes the flow and side effects (logging). Also the impact on the JWKS cache. Afaik, we will fail with discovery (URL construction error), but we need to double-check and write it down. # References Context (Slack): - https://gitlab.slack.com/archives/C04KWTK3GFJ/p1738669287225029 - https://gitlab.slack.com/archives/C04KWTK3GFJ/p1738336961746229 # Results I tested everything in GDK, with locally running CDot, locally running AI GW. The setup is very similar to what is described here - https://gitlab.com/gitlab-org/cloud-connector/gitlab-cloud-connector/-/merge_requests/71#how-to-set-up-and-validate-locally <table> <tr> <th> `AIGW_GITLAB_API_URL` ENV state </th> <th>GDK Healthcheck status</th> <th>JWKS set status</th> <th>Relevant CC logs</th> </tr> <tr> <td>Set and Correct</td> <td>OK</td> <td> OK: * 2 local AI GW keys, * 2 GitLab instance keys, * 2 local Cdot keys `JWKS refreshed` OK state cached for 24h or until pod restart </td> <td> \- </td> </tr> <tr> <td>Unset (not declared in .env)</td> <td> ERR: `Authentication with the AI gateway services failed: AI Gateway returned code 401: {"error":"Forbidden by auth provider"}` </td> <td> OK-ish: * 2 local AI GW keys, * 2 GitLab instance keys, * 2 PROD Cdot keys - we default config to `customer_portal_url: str = "https://customers.gitlab.com"`, see [AI GW code](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/blob/c5d6bd4abea00c668b4402d3ce724ff3a503789c/ai_gateway/config.py#L188) `JWKS refreshed` OK state cached for 24h or until pod restart </td> <td> `JWTError: Signature verification failed` is logged - because my token is issued by LOCAL CDot </td> </tr> <tr> <td>Explicitly set to empty</td> <td> ERR: Same `401` as above </td> <td> NOT OK: 4/6 keys are fetched `Incomplete JWKS cached: some key providers failed, no old cache to fall back to` INCOMPLETE state cached for 24h or until pod restart </td> <td> First, we log an "obscure" (not self-explaining) error on URL construction - TODO: **improve** - `{"status_code": null, "exception_class": "MissingSchema", "backtrace": "Traceback (most recent call last):\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/gitlab_cloud_connector/providers.py", line 309, in _fetch_well_known\n res = requests.get(url=url, timeout=REQUEST_TIMEOUT_SECONDS)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/api.py", line 73, in get\n return request("get", url, params=params, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/api.py", line 59, in request\n return session.request(method=method, url=url, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/sessions.py", line 575, in request\n prep = self.prepare_request(req)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/sessions.py", line 484, in prepare_request\n p.prepare(\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/models.py", line 367, in prepare\n self.prepare_url(url, params)\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/models.py", line 438, in prepare_url\n raise MissingSchema(\nrequests.exceptions.MissingSchema: Invalid URL '/.well-known/openid-configuration': No scheme supplied. Perhaps you meant `[`https:///.well-known/openid-configuration?\nfetch_well_knownfailed\n`](https:///.well-known/openid-configuration?%5Cnfetch_well_knownfailed%5Cn)`", "extra": {}, "correlation_id": "01JKAXCQQV09CW2JE2H7B0FEFH", "logger": "cloud_connector", "level": "error", "type": "mlops", "stage": "main", "timestamp": "2025-02-05T11:19:44.019283Z", "message": "Invalid URL '/.well-known/openid-configuration': No scheme supplied. Perhaps you meanthttps:///.well-known/openid-configuration?"}` Then, `JWTError: Signature verification failed` is logged </td> </tr> <tr> <td> Set to an invalid value (e.g. `=abc`) </td> <td> Same `401` as in `Explicitly set to empty` ^^^ </td> <td> Same as in `Explicitly set to empty` ^^^ </td> <td> Same as in `Explicitly set to empty` ^^^ We should improve err! </td> </tr> <tr> <td> Set to an unreachable host (e.g. replace the port) </td> <td> Same `401` as in `Explicitly set to empty` ^^^ </td> <td> Same as in `Explicitly set to empty` ^^^ </td> <td> First, we see `{"status_code": null, "exception_class": "ConnectionError", "backtrace": "Traceback (most recent call last):\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 196, in _new_conn\n sock = connection.create_connection(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection\n raise err\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection\n sock.connect(sa)\nConnectionRefusedError: [Errno 61] Connection refused\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen\n response = self._make_request(\n ^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 495, in _make_request\n conn.request(\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 398, in request\n self.endheaders()\n File "/Users/al/.asdf/installs/python/3.11.11/lib/python3.11/http/client.py", line 1298, in endheaders\n self._send_output(message_body, encode_chunked=encode_chunked)\n File "/Users/al/.asdf/installs/python/3.11.11/lib/python3.11/http/client.py", line 1058, in _send_output\n self.send(msg)\n File "/Users/al/.asdf/installs/python/3.11.11/lib/python3.11/http/client.py", line 996, in send\n self.connect()\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 236, in connect\n self.sock = self._new_conn()\n ^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 211, in _new_conn\n raise NewConnectionError(\nurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x154667050>: Failed to establish a new connection: [Errno 61] Connection refused\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/adapters.py", line 667, in send\n resp = conn.urlopen(\n ^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen\n retries = retries.increment(\n ^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 519, in increment\n raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nurllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=6666): Max retries exceeded with url: /.well-known/openid-configuration (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x154667050>: Failed to establish a new connection: [Errno 61] Connection refused'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/gitlab_cloud_connector/providers.py", line 309, in _fetch_well_known\n res = requests.get(url=url, timeout=REQUEST_TIMEOUT_SECONDS)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/api.py", line 73, in get\n return request("get", url, params=params, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/api.py", line 59, in request\n return session.request(method=method, url=url, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request\n resp = self.send(prep, **send_kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send\n r = adapter.send(request, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/al/dev/ai-assist/.venv/lib/python3.11/site-packages/requests/adapters.py", line 700, in send\n raise ConnectionError(e, request=request)\nrequests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=6666): Max retries exceeded with url: /.well-known/openid-configuration (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x154667050>: Failed to establish a new connection: [Errno 61] Connection refused'))\nfetch_well_known failed\n", "extra": {}, "correlation_id": "01JKAY46Z7G5BR1422CATYKJE0", "logger": "cloud_connector", "level": "error", "type": "mlops", "stage": "main", "timestamp": "2025-02-05T11:32:33.375352Z", "message": "HTTPConnectionPool(host='127.0.0.1', port=6666): Max retries exceeded with url: /.well-known/openid-configuration (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x154667050>: Failed to establish a new connection: [Errno 61] Connection refused'))"}` Then, `JWTError: Signature verification failed` is logged </td> </tr> </table> # Thoughts, Questions to answer * In a short term, should we "allow" empty ENV var for CDot host (with SHM in mind)? By "allowing" I mean not even trying to resolve the host, and, as a result, not failing. And treating the JWKS of 4/6 keys complete. Or is it too much tinkering and we should go straight and make OIDC provider list configurable in AI GW? (https://gitlab.com/gitlab-org/gitlab/-/issues/517088) * If we do so, is there some way to understand that we are in SHM context? (to make setup more reliable) * We need better error messages around host resolving. They should be self-explanatory. We can keep the whole stack trace/etc, but it would be nice to have a simple problem statement. Previously, it was not a priority, because we had a full control over AI GW and DWF envs and we were sure all is set as expected. So the probability or such issue was minimal. Now, with SHM, it is more important. # Action items * [x] Create an issue to improve our error messages (logs) around the OIDC provider host resolve - https://gitlab.com/gitlab-org/cloud-connector/gitlab-cloud-connector/-/issues/59 * [x] Open an issue to decide if we handle `AIGW_GITLAB_API_URL=` as a special case - see https://gitlab.com/gitlab-org/gitlab/-/issues/517089
issue