Skip to content

AI-related Incidents - Trends and Action Items

Context

We have had a lot of incidents in the AI teams recently and they have not qualified for FCLs nor do they seem to be slowing down. The intent behind this issue is to identify additional data and action items to begin driving forward.

Incidents

Date Incident Review Root Cause Owning Team
2024-09-25

Anthropic rate limiting errors

N/A Anthropic had an invisible rate limit on our account, fallback mechanisms would help support this AI Framework
2024-08-12 A customer encountered an issue when the Language Server failed to authenticate with the GitLab monolith, preventing the retrieval of code suggestions.

https://gitlab.com/gitlab-org/gitlab/-/work_items/493230+

Gitlab Monolith Authentication

  • Code Suggestions
  • Editor Extensions
2024-08-08

severity3 All Claude 3.5 features are down due to an Anthropic outage

Anthropic's infrastructure provider (upstream dependency) AI Framework
2024-08-06

Ultimate Dotcom Customers with Duo Pro licensing were unable to use

  • IntelliJ + WebStorm with Code Suggestions
  • VS Code with Code Suggestions

gitlab-org/editor-extensions/gitlab-jetbrains-plugin#561 (closed)

Code Suggestions + Certificates

Solution MRs

  • Editor Extensions
  • Code Suggestions
2024-08-02

severity2 2024-08-02: Duo Chat getting A1000 / Bad Gatewa... (#18357 - closed)

Anthropic API

AI Framework
2024-07-31

severity3

2024-07-31: Some users of AI may get 401 unauthorized

https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/7112 Rotate jwt_signing_key

Provision, Cloud Connection

2024-07-24

priority1 severity1 https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5700+

https://gitlab.com/gitlab-org/gitlab/-/work_items/493232+ (same as the row below)

2024-07-11

severity3 2024-07-11: Code suggestions erroneously return... (#18262 - closed)

https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18285+

Issue/Root Cause seems to be the same as the row above

https://gitlab.com/gitlab-org/gitlab/-/work_items/493230+

https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18285#summary Conflicting cache keyvs / stale cache.

Provision, Cloud Connection

2024-07-01

severity3 2024-07-01: Some users are seeing an API error ... (#18219 - closed)

2024-06-25

severity3 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191+

Incident Review: 2024-06-25: Error Code A1001 o... (#18204 - closed)

#18204 (closed) (Local environment matching production would help)

Duo Chat
2024-05-24

severity2 2024-05-24: code suggestions returning 403 in p... (#18062 - closed)

2024-05-24: Incident Review: code suggestions r... (#18064 - closed)

#18064 (closed) No backwards compatibility with caching

Cloud Connector
2024-05-24

severity3 2024-05-24: QA smoke (gstg) for code suggestion... (#18065 - closed)

#18065 (comment 1921564299) State change mismatch, environment mismatch

IDE

Trends

  • Debugging through the stack, identifying root cause, and determining team ownership are both difficult and unclear
  • Caching, state changes, and environment mismatches ("it worked locally")
  • They are all around allowing our customers to use AI features that they pay for (licensing, access, etc) as opposed to problems with an actual LLM or the feature

Action Items

Corrective Action DRI Status

Implement a model fallback strategy in the event of an outage

Implement a multi-provider strategy to reduce d... (gitlab-org&14873 - closed)

AI Framework

Finished, but blocked from release by https://gitlab.com/gitlab-org/gitlab/-/issues/455110 (In dev)

Improve our Model Flexibility (flexible model deployments and updates)

AI Framework Under investigation

GitLab Duo - Demo Environment for Solutions Arc... (gitlab-org/quality/quality-engineering/team-tasks#2933 - closed)

@m_gill , @jeffersonmartin, @poffey21

In Progress

Reduce dependency on Sidekiq and improve latency

Improve resiliency of Sidekiq during AI outages

Isolate the LLM worker in its own shard

AI Framework In progress

Streamline AI Logging for easier debugging, including self hosted

AI Framework

(video)

Customers should have confidence in their Cloud... (gitlab-org&14518)

Cloud Connector

Fix stale Cloud Connector service catalog (gitlab-org/gitlab!154094 - merged)

Cloud Connector

Remove Add-On caching

Cloud Connector

https://gitlab.com/groups/gitlab-org/-/epics/15142+

Cloud Connector In progress (long-running)

Review use of caching throughout Cloud Connecto... (gitlab-org/gitlab#485042 - closed)

Cloud Connector Under discussion

Preventative Testing

https://gitlab.com/gitlab-org/fulfillment/meta/-/issues/1871+

Test Platforms

Review and document AI-related specs that block... (gitlab-org/quality/quality-engineering/team-tasks#2899)

Test Platforms Not Started

Test AI end-to-end across all systems

gitlab-org/gitlab#491036 (closed)

Test Platforms In progress

Runbooks

Create AI troubleshooting/incident runbook (gitlab-org/gitlab#467191 - closed)

AI Framework (Resolved with the below runbooks)

Enhance AI Incident Monitoring and Response Pro... (gitlab-org/gitlab#474608 - closed)

AI Framework

Duo Chat runbook (gitlab-org/gitlab#474548 - closed)

Duo Chat

Create Cloud Connector incident runbook (for en... (gitlab-org/cloud-connector-team/team-tasks#177 - closed)

Cloud Connector

Edited by Michelle Gill