Implement a multi-provider strategy to reduce dependency on a single AI service provider
Description
The recent 10+ hour outage with Anthropic highlights the need for a robust fallback strategy to ensure the continuity of our AI-powered features. We aim to implement a multi-layered approach to reduce our dependency on a single AI service and improve the resilience of our system.
This strategy will be implemented in three main steps:
Step 1: Provider Fallback
- Implement fallback mechanisms to switch between different AI service providers.
- Prioritize provider fallback over model version fallback, as it offers more immediate recovery options.
- Focus on providers that offer the same model (e.g., Claude 3.5 on Anthropic, AWS Bedrock, and Google Vertex).
Step 2: Enhance Fallback Options and Testing
- Determine what different strategies we need besides provider fallback
- Implement model version fallback as a secondary option, with careful consideration of potential risks.
- Identify critical AI-powered features that require immediate fallback support.
- Define feature-specific variance criteria for fallback decisions.
- Explore the use of custom models or alternative models (e.g., OpenAI) as additional fallback options.
- Implement a testing strategy for fallback scenarios, including internal usage patterns and quality assessment.
Step 3: Automatic fallback
- Develop logic to switch between providers based on availability and performance criteria.
- Create a centralized control mechanism for managing "AI is not working" scenarios across features.
- #478067 (closed)
Step 4: Improve Model Flexibility and Management
- Move model version, provider selection, and configuration out of the Rails model and into the AI Gateway.
- Centralize model configuration in the AI Gateway, similar to our prompt registry approach.
- Implement a flexible model selection mechanism in the AI Gateway.
- Develop an interface for easily updating and managing model versions and providers across the system.
- Create an AI Framework admin dashboard for controlling model/prompt usage percentages and manual overrides.
Additional Considerations
- Define clear criteria and rules for automatic fallover, including SLAs and recovery time objectives.
- Establish a cohesive SLA strategy that accounts for multiple providers and their individual SLAs.
- Consider implementing a tiered system (e.g., stable vs. edge) for different models and features.
- Explore options for customer model selection within a limited, vetted set of choices.
- Integrate this strategy with the existing AI Continuity plan and update as necessary.
By implementing this multi-layered approach, we aim to significantly improve the reliability, availability, and flexibility of our AI-powered features. This will minimize the impact of potential outages from any single provider and provide a more robust framework for managing our AI services.
Edited by Michelle Gill