fix: prompt caching not working on production (!3800) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

This is a blocker against feat: conversation caching on agentic chat (!3722 - merged)

What does this merge request do and why?

This MR fixes the issue that prompt caching not working on production, which includes the following changes:

Fix system prompt caching is not working on production. We're currently using Claude on VertexAI, however, it's not using prompt caching feature. FWIW, this still works with Claude on Anthropic but majority of the requests use Claude on VertexAI ref.
Add prompt_caching metadata to AIGW Prompt Registry. This metadata represents the prompt caching support for each prompt file.
Add the token usage logging. Especially, cache hit and read is important to measure the caching performance on production. This will be later visualized in DWS dashboard.

How to set up and validate locally

To test various models, modify ai_gateway/model_selection/unit_primitives.yml e.g.

  - feature_setting: "duo_chat"
    default_model: "claude_sonnet_4_20250514"

Example of cached data with claude_sonnet_4_20250514_vertex

Write:

{
    "event": "LLM call finished with token usage",
    "logger": "prompts",
    "level": "info",
    "correlation_id": "01K9BQ2J221NEHACQ8HENPBGTN",
    "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
    "workflow_id": "434",
    "model_engine": "anthropic-chat",
    "model_name": "claude-sonnet-4-20250514",
    "model_provider": "anthropic",
    "input_tokens": 23555,
    "output_tokens": 14,
    "total_tokens": 23569,
    "cache_read": 0,
    "cache_creation": 21911,
    "ephemeral_5m_input_tokens": 21911,
    "ephemeral_1h_input_tokens": 0,
    "timestamp": "2025-11-06T04:33:57.333660Z"
}

Read:

{
    "event": "LLM call finished with token usage",
    "logger": "prompts",
    "level": "info",
    "correlation_id": "01K9BQ3FCB4J6S80J7R35CCW6Z",
    "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
    "workflow_id": "434",
    "model_engine": "anthropic-chat",
    "model_name": "claude-sonnet-4-20250514",
    "model_provider": "anthropic",
    "input_tokens": 23575,
    "output_tokens": 14,
    "total_tokens": 23589,
    "cache_read": 21911,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0,
    "timestamp": "2025-11-06T04:34:29.106398Z"
}

Example of cached data with claude_sonnet_4_20250514

Write:

{
    "event": "LLM call finished with token usage",
    "logger": "prompts",
    "level": "info",
    "correlation_id": "01K9BQ6YGH4QN56TBD2A3NR9N1",
    "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
    "workflow_id": "434",
    "model_engine": "litellm-chat",
    "model_name": "claude-sonnet-4@20250514",
    "model_provider": "litellm",
    "input_tokens": 23613,
    "output_tokens": 13,
    "total_tokens": 23626,
    "cache_read": 0,
    "cache_creation": 21911,
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0,
    "timestamp": "2025-11-06T04:36:21.521772Z"
}

Read:

{
    "event": "LLM call finished with token usage",
    "logger": "prompts",
    "level": "info",
    "correlation_id": "01K9BQ4WRP18RQS3TWXBRBK26X",
    "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
    "workflow_id": "434",
    "model_engine": "litellm-chat",
    "model_name": "claude-sonnet-4@20250514",
    "model_provider": "litellm",
    "input_tokens": 23595,
    "output_tokens": 12,
    "total_tokens": 23607,
    "cache_read": 21911,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0,
    "timestamp": "2025-11-06T04:35:14.373850Z"
}

Example of cached data with the other selectable models:

"claude_sonnet_3_7_20250219"

{ "event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BREJMGFA6TYC1BJFTF2G3A", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 23632, "output_tokens": 15, "total_tokens": 23647, "cache_read": 0, "cache_creation": 21910, "ephemeral_5m_input_tokens": 21910, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T04:58:00.568634Z" }

{ "event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRFM7NK7TRTTX5PJ3Y2MJE", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 23653, "output_tokens": 12, "total_tokens": 23665, "cache_read": 21910, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T04:58:32.820750Z" }

"claude_sonnet_4_5_20250929"

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRP10XNRT9CQKZ289X47T0", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 23254, "output_tokens": 13, "total_tokens": 23267, "cache_read": 0, "cache_creation": 21124, "ephemeral_5m_input_tokens": 21124, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:02:05.709496Z"}

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRQ3GDR81ER9RK22FMP9PK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 23273, "output_tokens": 13, "total_tokens": 23286, "cache_read": 21124, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:02:40.846773Z"}

"claude_haiku_4_5_20251001"

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRR8W7BN52G9479TYVR6XS", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 23292, "output_tokens": 14, "total_tokens": 23306, "cache_read": 0, "cache_creation": 21124, "ephemeral_5m_input_tokens": 21124, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:03:18.914186Z"}

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRS8Z0J8EJ5AWNRZYS60MB", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 23312, "output_tokens": 12, "total_tokens": 23324, "cache_read": 21124, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:03:49.106805Z"}

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.
If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.

Edited Nov 06, 2025 by Shinya Maeda

fix: prompt caching not working on production

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports