Draft: Add LRU cache for model.bind_tools to resolve CPU bottleneck (!3768) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

What does this merge request do and why?

This MR implements a production-ready LRU in-memory cache for bind_tools() operations to resolve a performance bottleneck in the Duo Workflow Service.

The Problem

Google Cloud Profiler analysis revealed that bind_tools() operations consume 28.4ms and 6.03% of CPU time per request. This operation:

Occurs 3-4 times per request (once per agent initialization)
Runs synchronously in the asyncio event loop, blocking all other requests
Has no caching, causing repeated expensive schema format conversions
Limits theoretical maximum throughput to ~12 RPS

At 15 RPS load testing, the service experienced a 49% failure rate due to this bottleneck.

Root Cause: LangChain's bind_tools() performs expensive schema format conversion (OpenAI ↔️ Anthropic) on every agent initialization, with no built-in caching.

The Solution

This MR adds a thread-safe LRU cache that:

Caches the result of bind_tools() operations by (model_id, tool_signature, tool_choice)
Uses order-independent SHA256 hashing for stable cache keys
Implements LRU eviction policy with configurable max size (default: 128 entries)
Includes Prometheus metrics for monitoring (hits, misses, duration, size, evictions)
Provides structured logging for debugging
Is fully configurable via environment variables

Implementation Details

Files Created:

ai_gateway/prompts/bind_tools_cache.py - Core LRU cache implementation

Files Modified:

ai_gateway/prompts/base.py - Integration into Prompt class
example.env - Configuration options (3 new env vars)

Related Issues

Resolves: gitlab-org/gitlab#578158 - "Performance Bottleneck: Repeated pydantic.create_model() calls in bind_tools() blocking asyncio loop"
Related: https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/3794 - "Agentic AI performance test execution and data gathering"

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.
If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.

Edited Nov 03, 2025 by Dhruv Rathi

Draft: Add LRU cache for model.bind_tools to resolve CPU bottleneck