Implement LRU Cache for model.bind_tools to Resolve CPU Bottleneck

Context

Parent issue: #578158

We've identified that bind_tools() operations are consuming 28.4ms and 6.03% CPU time per call, causing performance degradation at scale. The root cause is LangChain's schema format conversion happening repeatedly without caching for each agent invocation.

When we call model.bind_tools(), LangChain needs to convert our tool schemas into the format that the LLM provider expects. For Anthropic (which we're using), it converts from OpenAI format to Anthropic format. During this conversion, LangChain calls _create_subset_model() which uses pydantic's create_model() to create new model classes on the fly.

The flame graph shows:

model.bind_tools()
  └─ _convert_to_anthropic_tool()
      └─ _create_subset_model()  ← This is where the time goes
          └─ pydantic.create_model()

A few factors make this particularly bad:

No caching: LangChain doesn't cache the converted schemas, so we're doing the same work over and over
Synchronous operation: This happens in the asyncio event loop, blocking all other requests
Large tool sets: We're binding ~35 tools per agent (static registry + MCP tools)

Goal

Implement a thread-safe LRU cache to store bound model instances and eliminate redundant bind_tools operations.

Scope

Implement BindToolsCache class with LRU eviction policy
Add tool signature computation (order-independent SHA256 hash)
Integrate cache into Prompt.__init__ in ai_gateway/prompts/base.py
Add Prometheus metrics (hits, misses, duration, size, evictions)

Technical approach

Cache key design:

cache_key = (
    model_id,        
    tool_signature,  # SHA256 hash of sorted tool names + schemas
    tool_choice      # "auto" | "required" | None
)

Implementation details:

Use OrderedDict for O(1) LRU operations
Use threading.RLock for thread safety
Tool signature is order-independent (handles MCP tools added in different orders)

Configuration:

BIND_TOOLS_CACHE_ENABLED - Feature flag (default: true)
BIND_TOOLS_CACHE_MAX_SIZE - Max entries (default: 128)
BIND_TOOLS_CACHE_LOG_HITS - Debug logging (default: false)

Expected impact

95%+ improvement on cache hits (28.4ms → <1ms)
95%+ cache hit rate expected
65% reduction in CPU time per request
~3x improvement in theoretical max throughput