Improve resilience of AI-powered features during Sidekiq outages
Description
AI-powered features are currently experiencing service disruptions during Sidekiq outages, significantly impacting user experience. This issue aims to explore and implement solutions to improve the resilience of these features during Sidekiq incidents.
Problem Statement
- During Sidekiq outages, AI-powered features become unusable
- Current architecture adds unnecessary round-trip delays, affecting response times
- Sidekiq is not the ideal tool for latency-sensitive operations where users expect immediate responses
Tasks
- Research alternative architectures that don't rely on Sidekiq for latency-sensitive AI operations
- Evaluate pros and cons of options like:
- Direct communication with AI gateway
- Workhorse offloading
- ActionController::Live with WebSockets
- Fibers for lightweight concurrency
- Create proof-of-concept implementations for 1-2 most promising approaches
- Measure performance and resilience improvements
- Propose final architecture and implementation plan
Related Links
-
Code Suggestions Implementation
- URL: https://gitlab.com/gitlab-org/gitlab/blob/8b47a4e4bbc33858f18d623634f63cd94ad9138d/ee/lib/api/code_suggestions.rb#L101
- Context: Current implementation of code suggestions using Workhorse offloading.
-
Merge Request for Workhorse Implementation
- MR: Serve completions endpoint through Workhorse (!126957 - merged) • Matthias Käppler • 16.3
- Context: Implementation of serving completions through Workhorse.
-
Related Issue: Completion Worker Delay
- Issue: https://gitlab.com/gitlab-org/gitlab/-/issues/482625+s
- Context: Highlights problems with Sidekiq processing delays affecting AI feature responsiveness.
-
Code Suggestion Performance Dashboard
- Epic: &12224
- Context: Detailed performance analysis for code suggestions, including latency breakdowns.
-
Sidekiq Incident Report
- Issue: 2024-09-03: The sidekiq_queueing SLI of the sid... (gitlab-com/gl-infra/production#18489 - closed) • Matt Smiley, Rehab+
- Context: Details of a recent Sidekiq outage affecting AI features.
- Issue: 2024-08-23: component_saturation_slo_out_of_bou... (gitlab-com/gl-infra/production#18435 - closed) • Steve Xuereb, Devin Sylva+
- Context: Sidekiq and DB CPU overloaded causing a delay in jobs being processed
- Issue: 2024-09-10: SidekiqServiceSidekiqQueueingApdexS... (gitlab-com/gl-infra/production#18538 - closed) • Vasilii Iakliushin
- Context: Redis trace chunks running out of memory backing up Sidekiq processes.
-
Flamegraph Profiling Guide
- URL: https://gitlab.com/gitlab-com/runbooks/-/blob/v2.198.1/docs/tutorials/how_to_use_flamegraphs_for_perf_profiling.md
- Context: Guide on using Flamegraphs for performance profiling.
-
Latency Monitoring Dashboard
- URL: https://log.gprd.gitlab.net/app/dashboards#/view/3684dc90-73f6-11ee-ac5b-8f88ebd04638
- Context: Dashboard showing median values for Sidekiq scheduling latency and worker run duration.
-
ActionController::Live Documentation
- URL: https://api.rubyonrails.org/classes/ActionController/Live.html
- Context: Official documentation for ActionController::Live, a proposed solution.
-
Vertex AI Claude Integration
- URL: https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude
- Context: Documentation on using Claude models via Google Vertex AI.
Edited by Nathan Weinshenker