Analyse test gaps for recent AI-related Incidents
List of recent incidents here: gitlab-com/gl-infra/production#18329 (closed)
Incidents Test Gap Analysis
Date | Incident | Test Gap Analysis issue | Test Gap Analysis Status | Is there remaining work to address this test gap? | Decision | Owning team |
---|---|---|---|---|---|---|
2024-08-12 |
A customer encountered an issue when the Language Server failed to authenticate with the GitLab monolith, preventing the retrieval of code suggestions. https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1522+ |
|
Yes. In Progress. Team: Test Platform |
Conclusion: https://gitlab.com/gitlab-org/gitlab/-/work_items/493230#note_2138314453 Planned and ongoing work is in epic: gitlab-org/quality&82 |
|
|
2024-08-08 |
severity3 All Claude 3.5 features are down due to an Anthropic outage |
Discussion here #497139 (comment 2138091802) |
|
No |
No test gap identified as the incident was caused by 3rd party outage and also caught by e2e tests. It sparked SET discussion about e2e tests dependancy on 3rd party and how we can go about it. So there might be a follow up work, but not directly related to this particular incident |
AI Framework |
2024-08-06 |
Ultimate Dotcom Customers with Duo Pro licensing were unable to use
gitlab-org/editor-extensions/gitlab-jetbrains-plugin#561 (closed) |
gitlab-org/editor-extensions/gitlab-jetbrains-plugin#561 (comment 2173005803) |
Yes Investigation is in progress |
There is a gap that can be covered with e2e. Work In progress: discussing new e2e coverage relating to using extensions with different network configurations to instances |
|
|
2024-08-02 |
severity2 2024-08-02: Duo Chat getting A1000 / Bad Gatewa... (gitlab-com/gl-infra/production#18357 - closed) |
|
No |
This was caught by e2e tests. No test gap identified as the incident was caused by 3rd party outage and also caught by e2e tests. |
AI Framework | |
2024-07-31 |
Corrective action issue: gitlab-com/gl-infra/production#18349 (closed) Test gap discussion: |
|
Yes. Team: Cloud connector |
There is a test gap, but tests are technically unfeasible to implement. There are multiple action items to prevent key-rotation related incident in the future. Cloud connector team's epic https://gitlab.com/groups/gitlab-org/-/epics/15142 |
||
2024-07-24 (it is the same incident as the row below) |
priority1 severity1 https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5700+ |
https://gitlab.com/gitlab-org/gitlab/-/work_items/493232 (it is the same as the row below) |
|
No |
The bug was picked up by e2e tests on staging. The test was not The second incident (SM) happened because the original fix was not backported. Decision: No further actions is needed from Test Platform. Similar defects will be caught by existing blocking e2e tests |
|
2024-07-11 |
severity3 2024-07-11: Code suggestions erroneously return... (gitlab-com/gl-infra/production#18262 - closed) |
|
No | same as above | ||
2024-07-01 |
severity3 2024-07-01: Some users are seeing an API error ... (gitlab-com/gl-infra/production#18219 - closed) |
Not sure. because root cause is still not cleat. |
Root cause is not fully clear. Incident post-action needs improvement. e2e tests failed on canary and staging.canary because of this incident, but these tests are not blocking, because AI providers are not reliable. Some unit tests were added by Duo Chat team. |
|||
2024-06-25 |
severity3 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191+ Incident Review: 2024-06-25: Error Code A1001 o... (gitlab-com/gl-infra/production#18204 - closed) |
|
No |
No e2e test gap found. Low-level test gap identified and covered as a part of post-incident action in !157566 (diffs) |
Duo Chat | |
2024-05-24 |
severity2 2024-05-24: code suggestions returning 403 in p... (gitlab-com/gl-infra/production#18062 - closed) |
|
Yes Team: Test Governance (Q4 OKR) |
This was caught by existing blocking e2e tests, but not on staging-canary. Probably because cache remains valid for some time. Action item: increase coverage for CloudConnector by Selectively run ai-gateway tests for CloudConnector changes |
Cloud Connector | |
2024-05-24 |
severity3 2024-05-24: QA smoke (gstg) for code suggestion... (gitlab-com/gl-infra/production#18065 - closed) |
|
No | Nothing to improve. The bug was caught by existing tests in staging-canary and blocked the deployment. | IDE |