Consider pros/cons of decreasing max_output_tokens for code generation

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Response latency is dependent on output length:

Note that this may be different for streaming (the above chart shows data for not-streamed requests only - from local testing it seems that for streamed responses we don't log these).

We could:

check if max_tokens_to_sample value impacts duration also for streamed responses (currently set to 2048)
if yes, then consider if we could use smaller value, or maybe use smaller value conditionally on other situation (e.g. if user wants to generate something inside a method, it's probably that its content will be smaller than generating whole class/module)

Edited Aug 28, 2025 by 🤖 GitLab Bot 🤖