[Prompt engineering] Improve Code Suggestions quality using more accurate token lengths
Currently code-gecko
accepts a maximum of 2048 tokens. The AI Gateway just takes at most 2048 characters before the cursor and truncates the data before that. If you look at the data, we usually always send the full 2048 characters:
Hypothesis: We can actually send more than 2048 characters, and we'll get better results if we include more data. Currently the GitLab VSCode extension sends all lines before the cursor and 10 lines after the cursor.
The current rule-of-thumb is that 4 bytes equal to 1 token. Most of the time we are using 1 byte/character. The naive thing to do is send at most 2048 bytes * 4 = 8192 bytes, but then we run the risk of getting an error from the model. We could just retry with a smaller length if we run into that.
A more intelligent approach would be to use language-specific tokenizers. For example, since JavaScript and TypeScript are the most used code completion tools, we could use the Python slimit
package to count.