Implement a multi-provider strategy to reduce dependency on a single AI service provider

Description

The recent 10+ hour outage with Anthropic highlights the need for a robust fallback strategy to ensure the continuity of our AI-powered features. We aim to implement a multi-layered approach to reduce our dependency on a single AI service and improve the resilience of our system.

This strategy will be implemented in three main steps:

Step 1: Provider Fallback

Implement fallback mechanisms to switch between different AI service providers.
Prioritize provider fallback over model version fallback, as it offers more immediate recovery options.
Focus on providers that offer the same model (e.g., Claude 3.5 on Anthropic, AWS Bedrock, and Google Vertex).

Step 2: Enhance Fallback Options and Testing

Determine what different strategies we need besides provider fallback
Implement model version fallback as a secondary option, with careful consideration of potential risks.
Identify critical AI-powered features that require immediate fallback support.
Define feature-specific variance criteria for fallback decisions.
Explore the use of custom models or alternative models (e.g., OpenAI) as additional fallback options.
Implement a testing strategy for fallback scenarios, including internal usage patterns and quality assessment.

Step 3: Automatic fallback

Develop logic to switch between providers based on availability and performance criteria.
Create a centralized control mechanism for managing "AI is not working" scenarios across features.
#478067 (closed)

Step 4: Improve Model Flexibility and Management

Move model version, provider selection, and configuration out of the Rails model and into the AI Gateway.
Centralize model configuration in the AI Gateway, similar to our prompt registry approach.
Implement a flexible model selection mechanism in the AI Gateway.
Develop an interface for easily updating and managing model versions and providers across the system.
Create an AI Framework admin dashboard for controlling model/prompt usage percentages and manual overrides.

Additional Considerations

Define clear criteria and rules for automatic fallover, including SLAs and recovery time objectives.
Establish a cohesive SLA strategy that accounts for multiple providers and their individual SLAs.
Consider implementing a tiered system (e.g., stable vs. edge) for different models and features.
Explore options for customer model selection within a limited, vetted set of choices.
Integrate this strategy with the existing AI Continuity plan and update as necessary.

By implementing this multi-layered approach, we aim to significantly improve the reliability, availability, and flexibility of our AI-powered features. This will minimize the impact of potential outages from any single provider and provide a more robust framework for managing our AI services.

Edited Aug 13, 2024 by Michelle Gill