Consider pros/cons of decreasing max_output_tokens for code generation
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Response latency is dependent on output length:

Note that this may be different for streaming (the above chart shows data for not-streamed requests only - from local testing it seems that for streamed responses we don't log these).
We could:
- check if
max_tokens_to_samplevalue impacts duration also for streamed responses (currently set to 2048) - if yes, then consider if we could use smaller value, or maybe use smaller value conditionally on other situation (e.g. if user wants to generate something inside a method, it's probably that its content will be smaller than generating whole class/module)
Edited by 🤖 GitLab Bot 🤖