fix: prompt caching not working on production
This is a blocker against feat: conversation caching on agentic chat (!3722)
What does this merge request do and why?
This MR fixes the issue that prompt caching not working on production, which includes the following changes:
- Fix system prompt caching is not working on production. We're currently using Claude on VertexAI, however, it's not using prompt caching feature. FWIW, this still works with Claude on Anthropic but majority of the requests use Claude on VertexAI ref.
- Add
prompt_cachingmetadata to AIGW Prompt Registry. This metadata represents the prompt caching support for each prompt file. - Add the token usage logging. Especially, cache hit and read is important to measure the caching performance on production. This will be later visualized in DWS dashboard.
Related to https://gitlab.com/gitlab-org/gitlab/-/issues/577544+
How to set up and validate locally
To test various models, modify ai_gateway/model_selection/unit_primitives.yml e.g.
- feature_setting: "duo_chat"
default_model: "claude_sonnet_4_20250514"
Example of cached data with claude_sonnet_4_20250514_vertex
Write:
{
"event": "LLM call finished with token usage",
"logger": "prompts",
"level": "info",
"correlation_id": "01K9BQ2J221NEHACQ8HENPBGTN",
"gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
"workflow_id": "434",
"model_engine": "anthropic-chat",
"model_name": "claude-sonnet-4-20250514",
"model_provider": "anthropic",
"input_tokens": 23555,
"output_tokens": 14,
"total_tokens": 23569,
"cache_read": 0,
"cache_creation": 21911,
"ephemeral_5m_input_tokens": 21911,
"ephemeral_1h_input_tokens": 0,
"timestamp": "2025-11-06T04:33:57.333660Z"
}
Read:
{
"event": "LLM call finished with token usage",
"logger": "prompts",
"level": "info",
"correlation_id": "01K9BQ3FCB4J6S80J7R35CCW6Z",
"gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
"workflow_id": "434",
"model_engine": "anthropic-chat",
"model_name": "claude-sonnet-4-20250514",
"model_provider": "anthropic",
"input_tokens": 23575,
"output_tokens": 14,
"total_tokens": 23589,
"cache_read": 21911,
"cache_creation": 0,
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0,
"timestamp": "2025-11-06T04:34:29.106398Z"
}
Example of cached data with claude_sonnet_4_20250514
Write:
{
"event": "LLM call finished with token usage",
"logger": "prompts",
"level": "info",
"correlation_id": "01K9BQ6YGH4QN56TBD2A3NR9N1",
"gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
"workflow_id": "434",
"model_engine": "litellm-chat",
"model_name": "claude-sonnet-4@20250514",
"model_provider": "litellm",
"input_tokens": 23613,
"output_tokens": 13,
"total_tokens": 23626,
"cache_read": 0,
"cache_creation": 21911,
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0,
"timestamp": "2025-11-06T04:36:21.521772Z"
}
Read:
{
"event": "LLM call finished with token usage",
"logger": "prompts",
"level": "info",
"correlation_id": "01K9BQ4WRP18RQS3TWXBRBK26X",
"gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=",
"workflow_id": "434",
"model_engine": "litellm-chat",
"model_name": "claude-sonnet-4@20250514",
"model_provider": "litellm",
"input_tokens": 23595,
"output_tokens": 12,
"total_tokens": 23607,
"cache_read": 21911,
"cache_creation": 0,
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0,
"timestamp": "2025-11-06T04:35:14.373850Z"
}
Example of cached data with the other selectable models:
"claude_sonnet_3_7_20250219"
{ "event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BREJMGFA6TYC1BJFTF2G3A", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 23632, "output_tokens": 15, "total_tokens": 23647, "cache_read": 0, "cache_creation": 21910, "ephemeral_5m_input_tokens": 21910, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T04:58:00.568634Z" }
{ "event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRFM7NK7TRTTX5PJ3Y2MJE", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 23653, "output_tokens": 12, "total_tokens": 23665, "cache_read": 21910, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T04:58:32.820750Z" }
"claude_sonnet_4_5_20250929"
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRP10XNRT9CQKZ289X47T0", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 23254, "output_tokens": 13, "total_tokens": 23267, "cache_read": 0, "cache_creation": 21124, "ephemeral_5m_input_tokens": 21124, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:02:05.709496Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRQ3GDR81ER9RK22FMP9PK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 23273, "output_tokens": 13, "total_tokens": 23286, "cache_read": 21124, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:02:40.846773Z"}
"claude_haiku_4_5_20251001"
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRR8W7BN52G9479TYVR6XS", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 23292, "output_tokens": 14, "total_tokens": 23306, "cache_read": 0, "cache_creation": 21124, "ephemeral_5m_input_tokens": 21124, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:03:18.914186Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9BRS8Z0J8EJ5AWNRZYS60MB", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "434", "model_engine": "anthropic-chat", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 23312, "output_tokens": 12, "total_tokens": 23324, "cache_read": 21124, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-06T05:03:49.106805Z"}
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed. -
If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.
Edited by Shinya Maeda