Consider pros/cons of decreasing max_output_tokens for code generation

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Response latency is dependent on output length: duration-by-output

Note that this may be different for streaming (the above chart shows data for not-streamed requests only - from local testing it seems that for streamed responses we don't log these).

We could:

  • check if max_tokens_to_sample value impacts duration also for streamed responses (currently set to 2048)
  • if yes, then consider if we could use smaller value, or maybe use smaller value conditionally on other situation (e.g. if user wants to generate something inside a method, it's probably that its content will be smaller than generating whole class/module)
Edited by 🤖 GitLab Bot 🤖