Skip to content

Analyse test gaps for recent AI-related Incidents

List of recent incidents here: gitlab-com/gl-infra/production#18329 (closed)

Incidents Test Gap Analysis

Date Incident Test Gap Analysis issue Test Gap Analysis Status Is there remaining work to address this test gap? Decision Owning team
2024-08-12

A customer encountered an issue when the Language Server failed to authenticate with the GitLab monolith, preventing the retrieval of code suggestions.

https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1522+

https://gitlab.com/gitlab-org/gitlab/-/work_items/493230

Done

Yes. In Progress.

Team: Test Platform

Conclusion: https://gitlab.com/gitlab-org/gitlab/-/work_items/493230#note_2138314453

Planned and ongoing work is in epic: gitlab-org/quality&82

  • Code Suggestions
  • Editor Extensions
2024-08-08

severity3 All Claude 3.5 features are down due to an Anthropic outage

Discussion here #497139 (comment 2138091802)

Done

No

No test gap identified as the incident was caused by 3rd party outage and also caught by e2e tests.

It sparked SET discussion about e2e tests dependancy on 3rd party and how we can go about it. So there might be a follow up work, but not directly related to this particular incident

AI Framework
2024-08-06

Ultimate Dotcom Customers with Duo Pro licensing were unable to use

  • IntelliJ + WebStorm with Code Suggestions
  • VS Code with Code Suggestions

gitlab-org/editor-extensions/gitlab-jetbrains-plugin#561 (closed)

gitlab-org/editor-extensions/gitlab-jetbrains-plugin#561 (comment 2173005803)

:loading: In progress

Yes

Investigation is in progress

There is a gap that can be covered with e2e.

Work In progress:

discussing new e2e coverage relating to using extensions with different network configurations to instances

  • Editor Extensions
  • Code Suggestions
2024-08-02

severity2 2024-08-02: Duo Chat getting A1000 / Bad Gatewa... (gitlab-com/gl-infra/production#18357 - closed)

#497139

Done

No

This was caught by e2e tests.

No test gap identified as the incident was caused by 3rd party outage and also caught by e2e tests.

AI Framework
2024-07-31

severity3

2024-07-31: Some users of AI may get 401 unauthorized

Corrective action issue: gitlab-com/gl-infra/production#18349 (closed)

Test gap discussion:

gitlab-com/gl-infra/production#18349 (comment 2173250486)

Done

Yes.

Team: Cloud connector

There is a test gap, but tests are technically unfeasible to implement.

There are multiple action items to prevent key-rotation related incident in the future. Cloud connector team's epic https://gitlab.com/groups/gitlab-org/-/epics/15142

Provision, Cloud Connection

2024-07-24

(it is the same incident as the row below)

priority1 severity1 https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5700+

https://gitlab.com/gitlab-org/gitlab/-/work_items/493232 (it is the same as the row below)

Done

No

The bug was picked up by e2e tests on staging. The test was not :blocking at the time. Now it is blocking deploys.

The second incident (SM) happened because the original fix was not backported.

Decision: No further actions is needed from Test Platform. Similar defects will be caught by existing blocking e2e tests

2024-07-11

severity3 2024-07-11: Code suggestions erroneously return... (gitlab-com/gl-infra/production#18262 - closed)

https://gitlab.com/gitlab-org/gitlab/-/work_items/493232

Done

No same as above

Provision, Cloud Connection

2024-07-01

severity3 2024-07-01: Some users are seeing an API error ... (gitlab-com/gl-infra/production#18219 - closed)

gitlab-com/gl-infra/production#18219 (comment 2185399848)

:loading: In progress

Not sure. because root cause is still not cleat.

Root cause is not fully clear. Incident post-action needs improvement.

e2e tests failed on canary and staging.canary because of this incident, but these tests are not blocking, because AI providers are not reliable.

Some unit tests were added by Duo Chat team.

2024-06-25

severity3 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191+

Incident Review: 2024-06-25: Error Code A1001 o... (gitlab-com/gl-infra/production#18204 - closed)

#498049 (closed)

Done

No

No e2e test gap found.

Low-level test gap identified and covered as a part of post-incident action in !157566 (diffs)

Duo Chat
2024-05-24

severity2 2024-05-24: code suggestions returning 403 in p... (gitlab-com/gl-infra/production#18062 - closed)

#497151

Done

Yes

Team: Test Governance (Q4 OKR)

This was caught by existing blocking e2e tests, but not on staging-canary. Probably because cache remains valid for some time.

Action item: increase coverage for CloudConnector by Selectively run ai-gateway tests for CloudConnector changes

Cloud Connector
2024-05-24

severity3 2024-05-24: QA smoke (gstg) for code suggestion... (gitlab-com/gl-infra/production#18065 - closed)

#498050 (closed)

Done

No Nothing to improve. The bug was caught by existing tests in staging-canary and blocked the deployment. IDE
Edited by Ksenia Kolpakova