Skip to content

Improve resilience of AI-powered features during Sidekiq outages

Description

AI-powered features are currently experiencing service disruptions during Sidekiq outages, significantly impacting user experience. This issue aims to explore and implement solutions to improve the resilience of these features during Sidekiq incidents.

Problem Statement

  • During Sidekiq outages, AI-powered features become unusable
  • Current architecture adds unnecessary round-trip delays, affecting response times
  • Sidekiq is not the ideal tool for latency-sensitive operations where users expect immediate responses

Tasks

  1. Research alternative architectures that don't rely on Sidekiq for latency-sensitive AI operations
  2. Evaluate pros and cons of options like:
    • Direct communication with AI gateway
    • Workhorse offloading
    • ActionController::Live with WebSockets
    • Fibers for lightweight concurrency
  3. Create proof-of-concept implementations for 1-2 most promising approaches
  4. Measure performance and resilience improvements
  5. Propose final architecture and implementation plan

Related Links

  1. Code Suggestions Implementation
  2. Merge Request for Workhorse Implementation
  3. Related Issue: Completion Worker Delay
  4. Code Suggestion Performance Dashboard
    • Epic: &12224
    • Context: Detailed performance analysis for code suggestions, including latency breakdowns.
  5. Sidekiq Incident Report
  6. Flamegraph Profiling Guide
  7. Latency Monitoring Dashboard
  8. ActionController::Live Documentation
  9. Vertex AI Claude Integration
Edited by Nathan Weinshenker