Skip to content

Implement LRU Cache for model.bind_tools to Resolve CPU Bottleneck

Context

Parent issue: #578158

We've identified that bind_tools() operations are consuming 28.4ms and 6.03% CPU time per call, causing performance degradation at scale. The root cause is LangChain's schema format conversion happening repeatedly without caching for each agent invocation.

When we call model.bind_tools(), LangChain needs to convert our tool schemas into the format that the LLM provider expects. For Anthropic (which we're using), it converts from OpenAI format to Anthropic format. During this conversion, LangChain calls _create_subset_model() which uses pydantic's create_model() to create new model classes on the fly.

The flame graph shows:

model.bind_tools()
  └─ _convert_to_anthropic_tool()
      └─ _create_subset_model()  ← This is where the time goes
          └─ pydantic.create_model()

A few factors make this particularly bad:

  1. No caching: LangChain doesn't cache the converted schemas, so we're doing the same work over and over
  2. Synchronous operation: This happens in the asyncio event loop, blocking all other requests
  3. Large tool sets: We're binding ~35 tools per agent (static registry + MCP tools)

Goal

Implement a thread-safe LRU cache to store bound model instances and eliminate redundant bind_tools operations.

Scope

  • Implement BindToolsCache class with LRU eviction policy
  • Add tool signature computation (order-independent SHA256 hash)
  • Integrate cache into Prompt.__init__ in ai_gateway/prompts/base.py
  • Add Prometheus metrics (hits, misses, duration, size, evictions)

Technical approach

Cache key design:

cache_key = (
    model_id,        
    tool_signature,  # SHA256 hash of sorted tool names + schemas
    tool_choice      # "auto" | "required" | None
)

Implementation details:

  • Use OrderedDict for O(1) LRU operations
  • Use threading.RLock for thread safety
  • Tool signature is order-independent (handles MCP tools added in different orders)

Configuration:

  • BIND_TOOLS_CACHE_ENABLED - Feature flag (default: true)
  • BIND_TOOLS_CACHE_MAX_SIZE - Max entries (default: 128)
  • BIND_TOOLS_CACHE_LOG_HITS - Debug logging (default: false)

Expected impact

  • 95%+ improvement on cache hits (28.4ms → <1ms)
  • 95%+ cache hit rate expected
  • 65% reduction in CPU time per request
  • ~3x improvement in theoretical max throughput