Skip to content

feat: conversation caching on agentic chat

What does this merge request do and why?

This MR addresses for https://gitlab.com/gitlab-org/gitlab/-/issues/577544+:

  • Implement require_prompt_caching_enabled_in_request custom field to cache_control_injection_points option to indicate that the cache point requires X-Gitlab-Model-Prompt-Cache-Enabled enabled on the request.
  • Use require_prompt_caching_enabled_in_request in Agentic Chat related AIGW Prompts.
  • DWS gRPC receive X-Gitlab-Model-Prompt-Cache-Enabled header to leverage whether the full prompt caching is enabled. This flag represents the user/group preference that can be set in GitLab instance.
  • DWS gRPC receive ai_gateway_allow_conversation_caching feature flag to derisk gitlab.com deployment. This flag can disable the full prompt caching regardless of the X-Gitlab-Model-Prompt-Cache-Enabled header.
  • Bump litellm version to 1.79.1. This is required to specify a negative index in the cache control injection, which was introduced by https://github.com/BerriAI/litellm/commit/212a339954e9b1402532a5cf515b5827af41934d.

Related to https://gitlab.com/gitlab-org/gitlab/-/work_items/577549+

How to set up and validate locally

  1. Check out GitLab-Rails counter part Pass user preference about caching to the Workf... (gitlab-org/gitlab!210655 - merged)
  2. Enable ai_gateway_allow_conversation_caching feature flag on GitLab-Rails.
  3. Make sure that https://docs.gitlab.com/user/project/repository/code_suggestions/#prompt-caching is enabled.
  4. Ask a question in Agentic Chat.

Example of conversation caching result:

1st turn - creating a cache up until the last user message:

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB7DX59X8VF5TXBNJHFVBV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22336, "output_tokens": 13, "total_tokens": 22349, "cache_read": 0, "cache_creation": 22333, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:14:45.710728Z"}

2nd turn - creating a cache between the last user message and the previous cache point:

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB87VP1FPZE6H0CSBS2DFK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22355, "output_tokens": 15, "total_tokens": 22370, "cache_read": 22333, "cache_creation": 19, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:10.880147Z"}

3rd turn - creating a cache between the last user message and the previous cache point:

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB9EJ9KCGB7M7HFANNVG9M", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22376, "output_tokens": 12, "total_tokens": 22388, "cache_read": 22352, "cache_creation": 21, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:50.672705Z"}

Also, notice that the cache_read at N turn is sum of cache_read + cache_write at N-1 turn. This represents that the cache is built up cumulatively.

More example of cumulative prompt caching with various models:

claude_sonnet_4_20250514

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VADDNT98TZZZFB0F4VN419", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22603, "output_tokens": 22, "total_tokens": 22625, "cache_read": 0, "cache_creation": 22600, "ephemeral_5m_input_tokens": 22600, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:00:33.958935Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAE4P8ADF4BDHRXBVVR5ZR", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22631, "output_tokens": 14, "total_tokens": 22645, "cache_read": 22600, "cache_creation": 28, "ephemeral_5m_input_tokens": 28, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:00:56.146740Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAEK3BZW2TAP34R702CJ8P", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-20250514", "model_provider": "anthropic", "input_tokens": 22651, "output_tokens": 11, "total_tokens": 22662, "cache_read": 22628, "cache_creation": 20, "ephemeral_5m_input_tokens": 20, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:01:10.640688Z"}

claude_haiku_4_5_20251001

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAGM8DZZ2E6C0FMGJD5YDR", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22065, "output_tokens": 23, "total_tokens": 22088, "cache_read": 0, "cache_creation": 22062, "ephemeral_5m_input_tokens": 22062, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:17.144523Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAH0ASXD70F121CH7TPK5A", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22094, "output_tokens": 12, "total_tokens": 22106, "cache_read": 22062, "cache_creation": 29, "ephemeral_5m_input_tokens": 29, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:28.863718Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAHR6TN0QTRNHKX6WZ3K8C", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-haiku-4-5-20251001", "model_provider": "anthropic", "input_tokens": 22112, "output_tokens": 19, "total_tokens": 22131, "cache_read": 22091, "cache_creation": 18, "ephemeral_5m_input_tokens": 18, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:02:53.198519Z"}

claude_sonnet_4_5_20250929

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAKH72BWFEY6PXVCC3XZSP", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22137, "output_tokens": 16, "total_tokens": 22153, "cache_read": 0, "cache_creation": 22134, "ephemeral_5m_input_tokens": 22134, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:03:55.205436Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAMADYCHQX13ACCPVVM2EQ", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22159, "output_tokens": 12, "total_tokens": 22171, "cache_read": 22134, "cache_creation": 22, "ephemeral_5m_input_tokens": 22, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:04:19.046870Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAN09APDAMH1EJS1P8WKGW", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-sonnet-4-5-20250929", "model_provider": "anthropic", "input_tokens": 22177, "output_tokens": 12, "total_tokens": 22189, "cache_read": 22156, "cache_creation": 18, "ephemeral_5m_input_tokens": 18, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:04:41.391139Z"}

claude_sonnet_3_7_20250219

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAQEX377GZRN65WFJD0X2S", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22798, "output_tokens": 23, "total_tokens": 22821, "cache_read": 0, "cache_creation": 22795, "ephemeral_5m_input_tokens": 22795, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:02.959445Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAR2FVQHSYJDEEQDQRC4XP", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22827, "output_tokens": 13, "total_tokens": 22840, "cache_read": 22795, "cache_creation": 29, "ephemeral_5m_input_tokens": 29, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:20.466464Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VAS87Z44PYFZDTSGJ1YDTV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "anthropic", "model_name": "claude-3-7-sonnet-20250219", "model_provider": "anthropic", "input_tokens": 22846, "output_tokens": 23, "total_tokens": 22869, "cache_read": 22824, "cache_creation": 19, "ephemeral_5m_input_tokens": 19, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:06:59.329241Z"}

claude_sonnet_4_5_20250929_vertex

{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB7DX59X8VF5TXBNJHFVBV", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22336, "output_tokens": 13, "total_tokens": 22349, "cache_read": 0, "cache_creation": 22333, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:14:45.710728Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB87VP1FPZE6H0CSBS2DFK", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22355, "output_tokens": 15, "total_tokens": 22370, "cache_read": 22333, "cache_creation": 19, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:10.880147Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VB9EJ9KCGB7M7HFANNVG9M", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4-5@20250929", "model_provider": "litellm", "input_tokens": 22376, "output_tokens": 12, "total_tokens": 22388, "cache_read": 22352, "cache_creation": 21, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:15:50.672705Z"}

Example of cumulative prompt caching is disabled when feature flag is disabled:

[1] pry(main)> ::Feature.disable(:ai_gateway_allow_conversation_caching)
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VBNDJ92JAXVZ83XET2CX83", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4@20250514", "model_provider": "litellm", "input_tokens": 22997, "output_tokens": 18, "total_tokens": 23015, "cache_read": 0, "cache_creation": 22111, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:22:24.218731Z"}
{"event": "LLM call finished with token usage", "logger": "prompts", "level": "info", "correlation_id": "01K9VBP1F7TYE7DQHW1HCZ16TY", "gitlab_global_user_id": "MMwwCtYAHYpnJyB07tz+J1HZ5cYKDB28u0ir4JRm+Gc=", "workflow_id": "447", "model_engine": "litellm", "model_name": "claude-sonnet-4@20250514", "model_provider": "litellm", "input_tokens": 23021, "output_tokens": 13, "total_tokens": 23034, "cache_read": 22111, "cache_creation": 0, "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 0, "timestamp": "2025-11-12T06:22:43.361192Z"}

Notice that the cache_creation at the second turn is zero.

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
  • If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.
Edited by Shinya Maeda

Merge request reports

Loading