Backwards Compatibility: Duo Chat Error Handling Improvements
## Problem to solve
* There are a number of issues affecting Duo Chat that impact customers - see [this GDoc](https://docs.google.com/document/d/1iQjwi_sDsnz0pFfNWwiH9_3eNl5cRowSxf0ydyY_RbI/edit?tab=t.0) (private).
* Backwards compatibility is the root cause for a significant number of these.
### More detail
* Our extensions have a unique problem that most other GitLab components do not - the most recent version has to support older GitLab instances. In general, we solve this with in-app version checks, which trigger graceful fallbacks.
As an example, a new feature released in the VS Code Extension in Dec 2024 (aka %"17.7") has to not break the extension for a user who is running an instance as old as 16.1 (As per [our guidance here](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/blob/33fa5a0e11f9a08af698ed4573998886df0ecbd1/README.md#minimum-supported-version)).
This is particularly challenging with GraphQL calls - since the API will reject a request that includes a field that doesn't exist (yet).
In addition:
* Older versions if GitLab are more difficult to test (once you go off the 'happy path' of the GDK).
* Duo features are experiencing a high rate of iteration.
* Duo Chat does not have comprehensive coverage with its error handlers. `*`
**All of the above means that Duo features within the extensions are relatively bug-prone.**
`*` This has been evident in previous bugs, where often errors were not surfaced when they should have been, or surfaced in odd ways. Combined with the above point, this leads to a sub-optimal experience for users.
## Proposed solutions
A number of improvements could be made in reporting and tracking these errors.
1. Catch and gracefully handle known unhandled errors.
* Discover which error paths are currently unhandled.
2. Better error messages (and current status) for users.
* Will also help customers and our staff who are configuring Duo to know what the next step is.
3. Surfacing in-the-field errors to developers.
4. Enable developers to test the extension more effectively, to prevent future incident-causing bugs.
## Crew Members
<table>
<tr>
<th></th>
<th>Engineers</th>
</tr>
<tr>
<td>DRI</td>
<td>
@tristan.read
</td>
</tr>
<tr>
<td>Technical Leads</td>
<td>
@tristan.read
@dbernardi
</td>
</tr>
</table>
---
:white_check_mark: - Done
:construction: - In Progress
:white_circle: - Not Started
## :sparkles: Catch and handle uncaught errors
<table>
<tr>
<th></th>
<th>Description</th>
<th>Sub Epic / Issues</th>
<th>Status</th>
<th>Technical Lead</th>
</tr>
<tr>
<td>Duo Chat should throw out error if something goes wrong</td>
<td>Proposed fix after finding a gap in Duo Chat's where an error should have be caught and displayed.</td>
<td>
:white_circle: [Duo Chat should throw out error if something goes wrong](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/651)
</td>
<td>
:white_circle:
</td>
<td></td>
</tr>
<tr>
<td>Duo Chat add timeout errors</td>
<td>Add code to ensure that the app continues to function if a request to the server hangs forever.</td>
<td>
:white_circle: [[VS Code] Ensure Duo Chat continues to operate after a network hang or timeout](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1746)
:white_circle: [[LS] Ensure Duo Chat continues to operate after a network hang or timeout](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/664)
</td>
<td>
:white_circle:
</td>
<td></td>
</tr>
<tr>
<td>Ensure top level try/catch statements</td>
<td>We had similar efforts in the LS previously - NodeJS processes are 'fragile' in that if an error bubbles up to the top of the stack, it'll crash the process/app/webview. Thus, it's important that all entry points are wrapped in a try/catch.</td>
<td>
:white_check_mark: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1699+
:construction: <a href="https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/647">[LS] Ensure Duo Chat services and packages have top-level try/catch statements</a>
</td>
<td>
:white_check_mark: [Webview solution in VSCode](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/merge_requests/2323) in %"17.9"
</td>
<td>
@dbernardi
</td>
</tr>
<tr>
<td>Ensure Duo Chat uses graphql fallback for action cable errors</td>
<td>A recent incident exposed a part of Duo Chat where there is no graceful fallback from Sockets to Graphql.</td>
<td>
:white_circle: [Duo Chat should fall back to graphql if action cable subscription fails](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/665)
</td>
<td>
:white_circle:
</td>
<td></td>
</tr>
<tr>
<td>Allow for Users to Configure Timeout Time for chat startup</td>
<td>
After debugging a customer issue (https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1699+), it would be a good idea to add a custom timeout time to allow longer startup periods to not time out Duo features.
</td>
<td>
:white_check_mark: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1826+
</td>
<td>
:white_check_mark: [Config timeout MR](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/merge_requests/2497) in %"17.11"
</td>
<td>
@viktomas
</td>
</tr>
<tr>
<td>
\[Spike\] Audit Duo Chat for unhandled error cases
</td>
<td>Recent incidents have shown that Duo Chat has poor "error boundary" coverage. Let's audit the code/functionality (timeboxed) to identify high-risk gaps.</td>
<td></td>
<td></td>
<td></td>
</tr>
</table>
## :sparkles: Better error messages for users
<table>
<tr>
<th></th>
<th>Description</th>
<th>Sub Epic / Issues</th>
<th>Status</th>
<th>Technical Lead</th>
</tr>
<tr>
<td>Improve VS Code webview error messages</td>
<td>Ensure that the extension catches and handles errors before the 'An error occurred while loading view' state.</td>
<td>
:white_check_mark: [[VS Code] DuoChat: Improve VS Code webview error messages](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1699)
</td>
<td>
:white_check_mark: Deliverable for %"17.9"
</td>
<td>
@dbernardi
</td>
</tr>
<tr>
<td>First Iteration: Show the current Duo Chat status to the user</td>
<td>
@tristan.read : IMO - this should most target the new LS chat implementation rather than VS Code.
</td>
<td>
:white_check_mark: [[VS Code] Show the current Duo Chat status to the user](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1712)
:white_check_mark: [[VS Code] Introduce Chat state manager in the extension and use LS policies to check chat availability](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1678)
</td>
<td>
:white_check_mark: Deliverable for %"17.8"
</td>
<td>
@dbernardi
</td>
</tr>
<tr>
<td>Add logging around Node API Timing on Webview startup</td>
<td>In relation to findings here, , it will be beneficial to add more timestamp logging when the webview is starting up. This will provide clarity when debugging issues on how long components are taking to startup.</td>
<td>
:white_circle: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1827+
</td>
<td>
:white_circle:
</td>
<td></td>
</tr>
<tr>
<td>
Second Iteration:
Trigger the Diagnostics page when users press the chat `STATUS` button
</td>
<td>
With the work from this epic, https://gitlab.com/groups/gitlab-org/editor-extensions/-/epics/82+, a diagnostics page is produced which will eventually show all states of Duo Chat, Code Suggestions, authentication, etc. The current state shows version numbers.
</td>
<td>
:white_check_mark: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1760+
:white_check_mark: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1798+
</td>
<td>
:white_check_mark: [Feature State Diag MR](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/merge_requests/2442) in %"17.11"
:white_check_mark: [STATUS button MR](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/merge_requests/2526) in %"18.0"
</td>
<td>
@dbernardi
</td>
</tr>
<tr>
<td>
Third Iteration: Use Webviews to Indicate Duo Chat Status to Users
Introduce more States for Duo Chat
</td>
<td>
@dbernardi : The way this is set up, these issues will follow: https://gitlab.com/groups/gitlab-org/-/epics/15661+ which is expected to be complete in %"17.9"
</td>
<td>
:x: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1755+
:x: <a href="https://gitlab.com/gitlab-org/gitlab/-/issues/507562">UX states for Duo Chat</a>
:x: https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1754+
</td>
<td>
:white_check_mark: https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/994+ was completed in %18.2. Decision was made NOT to move forward with this diagnostics as a webview.
</td>
<td>
@dbernardi
</td>
</tr>
</table>
## :sparkles: Surfacing errors to developers
<table>
<tr>
<th></th>
<th>Description</th>
<th>Sub Epic / Issues</th>
<th>Status</th>
<th>Technical Lead</th>
</tr>
<tr>
<td>Track GraphQL errors using Sentry</td>
<td>
We can selectively add Sentry reporting to certain errors.
A team Sentry project already exists.
Add coverage to some key functionality. This will allow GitLab to notice and act upon trends in the errors, e.g. a spike in proxy error rates.
</td>
<td>
:white_circle: [[LS] Track GraphQL errors using Sentry](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/issues/646)
:white_circle: [[VS Code] Track GraphQL errors using Sentry](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1737)
:white_circle: [[JetBrains] Track GraqhQL errors using Sentry](https://gitlab.com/gitlab-org/editor-extensions/gitlab-jetbrains-plugin/-/issues/852)
</td>
<td>
:white_circle:
</td>
<td></td>
</tr>
<tr>
<td>Surface Sentry errors in a useful, low-friction way.</td>
<td>
Weekly digest for new Sentry errors. Leverage AI analysis?
Set up rotation for team/extension 'Sentry officer' to check Sentry regularly.
</td>
<td></td>
<td></td>
<td></td>
</tr>
</table>
## :sparkles: Developer enablement
_Note: these don't quite fit under 'Error Handling' - we may want to move this / create a sub-epic._
_Here's the epic for these:_ https://gitlab.com/groups/gitlab-org/-/epics/16410+s
<table>
<tr>
<th></th>
<th>Description</th>
<th>Sub Epic / Issues</th>
<th>Status</th>
<th>Technical Lead</th>
</tr>
<tr>
<td>Testing guide</td>
<td>
Developing locally, while covering all combinations of Extension v GitLab instance version, is tricky. Contributors, especially outside of the core team, may not know what the supported configurations and common pitfalls / edge cases are.
Provide a testing guide to set all contributors up for success.
</td>
<td>
[Create a guide for testing Duo features against older gitlab instances](https://gitlab.com/gitlab-org/editor-extensions/meta/-/issues/189)
</td>
<td>
:white_check_mark: [MR](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/merge_requests/2376) in %"17.9"
</td>
<td>
@tristan.read
</td>
</tr>
<tr>
<td>Set up a reference environment of older GitLab</td>
<td>Have at least one older environment always available for team members to test against.</td>
<td>
:white_circle: [[Spike] Set up an Editor Extensions reference GitLab environment for one older GitLab instance version](https://gitlab.com/gitlab-org/editor-extensions/meta/-/issues/192)
</td>
<td>
:white_circle:
</td>
<td>
@tristan.read
</td>
</tr>
<tr>
<td>Extension test on the monolith</td>
<td>We should ensure that changing a vital desktop extension API on the monolith is flagged with a failing test.</td>
<td>
:white_circle: [Add a smoke test of Duo functionality on a Desktop Editor Extension to the GitLab monolith pipeline](https://gitlab.com/gitlab-org/gitlab/-/issues/509257)
</td>
<td>
:white_circle:
</td>
<td>
@tristan.read
</td>
</tr>
</table>
epic