feat: conversation caching on agentic chat
What does this merge request do and why?
This MR addresses for https://gitlab.com/gitlab-org/gitlab/-/issues/577544+:
- Implement
require_prompt_caching_enabled_in_requestcustom field tocache_control_injection_pointsoption to indicate that the cache point requiresX-Gitlab-Model-Prompt-Cache-Enabledenabled on the request. - Use
require_prompt_caching_enabled_in_requestin Agentic Chat related AIGW Prompts. - DWS gRPC receive
X-Gitlab-Model-Prompt-Cache-Enabledheader to leverage whether the full prompt caching is enabled. This flag represents the user/group preference that can be set in GitLab instance. - DWS gRPC receive
ai_gateway_allow_conversation_cachingfeature flag to derisk gitlab.com deployment. This flag can disable the full prompt caching regardless of theX-Gitlab-Model-Prompt-Cache-Enabledheader. - Bump
litellmversion to1.79.1. This is required to specify a negative index in the cache control injection, which was introduced by https://github.com/BerriAI/litellm/commit/212a339954e9b1402532a5cf515b5827af41934d.
Related to https://gitlab.com/gitlab-org/gitlab/-/work_items/577549+
How to set up and validate locally
- Check out GitLab-Rails counter part Pass user preference about caching to the Workf... (gitlab-org/gitlab!210655 - merged)
- Enable
ai_gateway_allow_conversation_cachingfeature flag on GitLab-Rails. - Make sure that https://docs.gitlab.com/user/project/repository/code_suggestions/#prompt-caching is enabled.
- Ask a question in Agentic Chat.
Example of conversation caching result:
1st turn - creating a cache up until the last user message:
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB7DX59X8VF5TXBNJHFVBV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22336, "output_tokens": 13, "total_tokens": 22349, "cache_read": 0, "cache_creation": 22333, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:14:45.710728Z"}
2nd turn - creating a cache between the last user message and the previous cache point:
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB87VP1FPZE6H0CSBS2DFK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22355, "output_tokens": 15, "total_tokens": 22370, "cache_read": 22333, "cache_creation": 19, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:10.880147Z"}
3rd turn - creating a cache between the last user message and the previous cache point:
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB9EJ9KCGB7M7HFANNVG9M", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22376, "output_tokens": 12, "total_tokens": 22388, "cache_read": 22352, "cache_creation": 21, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:50.672705Z"}
Also, notice that the cache_read at N turn is sum of cache_read + cache_write at N-1 turn. This represents that the cache is built up cumulatively.
More example of cumulative prompt caching with various models:
claude_sonnet_4_20250514
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VADDNT98TZZZFB0F4VN419", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22603, "output_tokens": 22, "total_tokens": 22625, "cache_read": 0, "cache_creation": 22600, "ephemeral_5m_input_tokens": 22600, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:00:33.958935Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAE4P8ADF4BDHRXBVVR5ZR", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22631, "output_tokens": 14, "total_tokens": 22645, "cache_read": 22600, "cache_creation": 28, "ephemeral_5m_input_tokens": 28, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:00:56.146740Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAEK3BZW2TAP34R702CJ8P", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22651, "output_tokens": 11, "total_tokens": 22662, "cache_read": 22628, "cache_creation": 20, "ephemeral_5m_input_tokens": 20, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:01:10.640688Z"}
claude_haiku_4_5_20251001
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAGM8DZZ2E6C0FMGJD5YDR", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22065, "output_tokens": 23, "total_tokens": 22088, "cache_read": 0, "cache_creation": 22062, "ephemeral_5m_input_tokens": 22062, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:17.144523Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAH0ASXD70F121CH7TPK5A", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22094, "output_tokens": 12, "total_tokens": 22106, "cache_read": 22062, "cache_creation": 29, "ephemeral_5m_input_tokens": 29, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:28.863718Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAHR6TN0QTRNHKX6WZ3K8C", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22112, "output_tokens": 19, "total_tokens": 22131, "cache_read": 22091, "cache_creation": 18, "ephemeral_5m_input_tokens": 18, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:53.198519Z"}
claude_sonnet_4_5_20250929
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAKH72BWFEY6PXVCC3XZSP", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22137, "output_tokens": 16, "total_tokens": 22153, "cache_read": 0, "cache_creation": 22134, "ephemeral_5m_input_tokens": 22134, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:03:55.205436Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAMADYCHQX13ACCPVVM2EQ", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22159, "output_tokens": 12, "total_tokens": 22171, "cache_read": 22134, "cache_creation": 22, "ephemeral_5m_input_tokens": 22, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:04:19.046870Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAN09APDAMH1EJS1P8WKGW", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22177, "output_tokens": 12, "total_tokens": 22189, "cache_read": 22156, "cache_creation": 18, "ephemeral_5m_input_tokens": 18, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:04:41.391139Z"}
claude_sonnet_3_7_20250219
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAQEX377GZRN65WFJD0X2S", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22798, "output_tokens": 23, "total_tokens": 22821, "cache_read": 0, "cache_creation": 22795, "ephemeral_5m_input_tokens": 22795, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:02.959445Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAR2FVQHSYJDEEQDQRC4XP", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22827, "output_tokens": 13, "total_tokens": 22840, "cache_read": 22795, "cache_creation": 29, "ephemeral_5m_input_tokens": 29, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:20.466464Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAS87Z44PYFZDTSGJ1YDTV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22846, "output_tokens": 23, "total_tokens": 22869, "cache_read": 22824, "cache_creation": 19, "ephemeral_5m_input_tokens": 19, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:59.329241Z"}
claude_sonnet_4_5_20250929_vertex
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB7DX59X8VF5TXBNJHFVBV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22336, "output_tokens": 13, "total_tokens": 22349, "cache_read": 0, "cache_creation": 22333, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:14:45.710728Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB87VP1FPZE6H0CSBS2DFK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22355, "output_tokens": 15, "total_tokens": 22370, "cache_read": 22333, "cache_creation": 19, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:10.880147Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB9EJ9KCGB7M7HFANNVG9M", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22376, "output_tokens": 12, "total_tokens": 22388, "cache_read": 22352, "cache_creation": 21, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:50.672705Z"}
Example of cumulative prompt caching is disabled when feature flag is disabled:
[1] pry(main)> ::Feature.disable(:ai_gateway_allow_conversation_caching)
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VBNDJ92JAXVZ83XET2CX83", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4@20250514", "model_provider": "litellm", "input_tokens": 22997, "output_tokens": 18, "total_tokens": 23015, "cache_read": 0, "cache_creation": 22111, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:22:24.218731Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VBP1F7TYE7DQHW1HCZ16TY", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4@20250514", "model_provider": "litellm", "input_tokens": 23021, "output_tokens": 13, "total_tokens": 23034, "cache_read": 22111, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:22:43.361192Z"}
Notice that the cache_creation at the second turn is zero.
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed. -
If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.