Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Support large YAML definitions in AI Catalog without hitting JSONB size limits
## Problem
The AI Catalog currently stores YAML definitions using a pattern that duplicates data:
```ruby
YAML.safe_load(params[:definition]).merge(yaml_definition: params[:definition])
```
This approach stores both:
1. The parsed YAML data structure
2. The original raw YAML string under `yaml_definition`
This duplication causes two issues:
1. **Size limit constraint**: The JsonSchemaValidator has a 64kb limit, but due to duplication, users can only submit ~32kb YAML files before hitting the limit. (Note: We need to [maintain this contraint](https://docs.gitlab.com/development/migration_style_guide/#storing-json-in-database))
2. **Scalability concern**: AI Catalog definitions could legitimately be much larger than 64kb (potentially 100kb+ for complex workflows), making the current approach unsuitable
## Current Implementation
```ruby
# ee/app/services/ai/catalog/concerns/yaml_definition_parser.rb
def definition_parsed
return unless params[:definition].present?
YAML.safe_load(params[:definition]).merge(yaml_definition: params[:definition])
rescue Psych::SyntaxError
nil
end
```
The merged result gets validated with:
```ruby
JsonSchemaValidator.new({
attributes: :definition,
size_limit: 64.kilobytes,
# ...
}).validate(self)
```
## Proposed Solutions
### Option 1: Object Storage (Recommended)
Store large YAML definitions in object storage and keep only metadata + reference in JSONB:
```ruby
# Database stores only:
{
"yaml_definition_url": "https://storage.../definitions/abc123.yml",
"checksum": "sha256:...",
"size": 150000,
"version": "v1",
"title": "My Workflow",
# ... other parsed metadata for querying
}
```
**Benefits:**
- No size limits
- Better performance (don't load large YAML unless needed)
- Follows GitLab patterns for large content
- Maintains audit trail with checksums
### Option 2: Hybrid Approach
- Small definitions (< 32kb): Keep current inline storage
- Large definitions: Automatically promote to object storage
- Transparent to consumers via accessor methods
## Acceptance Criteria
- [ ] Support YAML definitions larger than 64kb
- [ ] Maintain backward compatibility with existing definitions
- [ ] Preserve original YAML for audit/display purposes
- [ ] No performance regression for small definitions
- [ ] Follow GitLab patterns for large content storage
## Additional Context
- All other JsonSchemaValidator usage in GitLab uses exactly 64kb limit
- This would be the first case requiring larger limits
- AI workflows legitimately need complex, large definitions
- Current duplication pattern is inefficient but serves important purposes (audit trail, display fidelity)
issue