In VS Code, use the existing Code Suggestions UI to show multiple suggestions and allow users to cycle through the suggestions. This will only apply to code completion.
Under Experiment/BETA in VS Code:
Start with code completion (currently using code-gecko) and use candidateCountas proposed.
Add telemetry to measure the effects of the feature, usage, and latency.
Open Questions (see )
What model modifications would we need to do to handle a collection of code suggestions, ensuring that it can support storing and managing multiple suggestions efficiently?
Should we limit the number of suggestions users can cycle through?
How would this impact our latency metrics?
How would this affect how we do caching?
Can we implement telemetry to track which suggestion is accepted (initially shown vs. second or third)?
What model modifications would we need to do to handle a collection of code suggestions, ensuring that it can support storing and managing multiple suggestions efficiently?
For POC, the extension can ask multiple times; however, for production use, we'll need to get the collection of code suggestions from AI Gateway (groupai framework).
Modifying the AI Gateway for code completion to passcandidateCount to code-gecko is far more efficient than calling multiple times, it will return multiple completions in an extra 10-20% total latency from my local tests.
Code generation is a different story, which use case are we targeting with this POC?
We'll start with code completion. It's the highest volume and where users interact with the UI. Let me add that to the issue description for clarity.
Looking at the current logic (which may be getting into the weeds on implementation), would we have to adjust temperature, top_p, or top_k to get more results? Is there a way for us to test how many, on average, we can get back with the current thresholds?
@dashaadu I agree with @acook.gitlab that if we want to show multiple options, then we should just use candidateCount to request multiple completions from LLM.
Note that returning multiple options/variations is a model-specific feature and not all LLMs support it - from a brief check only code-gecko and code-bison support it, but not Anthropic's claude models which we use for code generation. This means that multiple options would be available for code completions only.
From discussion on this issue I understand that the intention was to just send multiple code suggestion requests to AI provider in parallel (both for code completions and code generations) from frontend, is that correct? Aside of increased pricing and resources, a major concern would be how you would assure that you get different variations for the same question? Although temperature param can be used to tweak randomness of response to some extent, I would expect that unless a suggestion request is very open (e.g. generating longer code), then response will be same if you run requests in parallel. Also a downside of increasing temperature would have negative impact on response accuracy.
Thought: instead of always preemptively requesting multiple variations, WDYT about just adding "reload (try again)" button which would trigger another completion request? So user wouldn't get multiple options in advance but only if explicitly requested.
Modifying the AI Gateway for code completion to passcandidateCount to code-gecko is far more efficient than calling multiple times, it will return multiple completions in an extra 10-20% total latency from my local tests.
@acook.gitlab thanks for chiming in, we are on the same page When I mentioned extension asking multiple times, I only meant in the POC to get an idea about UX. Without the need for everyone who wants to play with it to have the AI Gateway, GDK, and the extension running locally.
From discussion on this issue I understand that the intention was to just send multiple code suggestion requests to AI provider in parallel (both for code completions and code generations) from frontend, is that correct?
@jprovaznik only for POC, not for production use. For production, the AI Gateway would provide all suggestions in one response.
@jprovaznik From the product side, code generation completion would be where I'd like for us to start. I don't have a strong stance on the implementation, but the solution provided by Allen seems to be the less frictionless option.
Thought: instead of always preemptively requesting multiple variations, WDYT about just adding "reload (try again)" button which would trigger another completion request? So user wouldn't get multiple options in advance but only if explicitly requested.
Gotcha, so instead of saying give me x suggestions every time, we pull a new suggestion when a user requests one?
From the product side, code generation would be where I'd like for us to start.
@dashaadu I think you meant rather "code completion", right? If so, I agree it would be best to start with to test this.
It might be also worth to measure additional latency penalty (caused by requesting multiple completions) - in &12224 we invest quite a lot of effort to make code completion faster. If this change would make each request slower by e.g. 50-100ms or more, it might be a problem.
Gotcha, so instead of saying give me x suggestions every time, we pull a new suggestion when a user requests one?
Yes, exactly. I think this would be better option for "code generations" - I can imagine on "reload" we could include previous response(s) in prompt which would assure that new response is different. Also this would be compatible with streaming.
On the other side this option is not so suitable for "code completions" because for code completions we don't build a specific prompt (code-gecko expects just content above/below cursor) so I'm not sure we could use it for code completions.
@jprovaznik Yes, I meant code completion. I amended my comment above.
Does this summary cover it?
First iteration
Under Experiment/BETA in VS Code:
Start with code completion (currently using code-gecko) and use candidateCount as proposed by @acook.gitlab.
Add telemetry to measure the effects of the feature, usage, and latency
I will coordinate with FinOps to monitor the cost of adding functionality. If we get positive quantitative and qualitative feedback, this functionality will mature to GA.
Second Iteration
Under Experiment/BETA in VS Code:
Expand to code generation (via Claude 3) and pull a new generation on user request.
Add telemetry to measure usage and effects on latency.
We'll monitor the cost of the change in collaboration with FinOps.
@viktomas That makes sense, doing this now with our current implementation won't make sense since we're actively doing work to skip the monolith. We have two issues in that main epic that we also need to assist with, what about shuffling things around where we help and complete:
When you have an opportunity, please update the VS Code Table section of Milestone %17.1 to include these two issues, in support of the Bypass Monolith work.
This is in our newest priority list (from last night) the number 1 priority so we want to look at accelerating delivery
Latency impact
@jprovaznik I just tested it again with my local setup, they parallelize creation of suggestions. So with the maximum number of 4 candidates there was no measurable latency penalty count in 10 tries (some tries were even faster than with only 1 candidate) for 5 different code completion setups. So thats not a factor. @acook.gitlab has also a test setup to test latency broader i believe if we would measure it there again. Official API docs here - https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/code-completion
@acook.gitlab + @jprovaznik I would still suggest we implement it in both APIs and complete both API endpoints. So we have more flexibility on the IDE side with implementation (as we might have IDEs that don't implement straight away the code completion bypass but would want to have multi suggestions). Can you get an ETA for both the AI Gateway (or do you need ai-framework help on it) + Monolith extensions? Thanks!
@viktomas Makes sense to go for VSCode rather with the bypass logic, do you have a suggestion how to parallelise both things or accelerate? (as i am working on bringing more capacity for editor extension work) Only worry that i have is that due to the dependencies and not a clear rollout plan we might be blocked for longer then expected and need to clarify the priority of it (we will also try to accelerate delivery there). Could you have a very rough estimate how much work it would be to do both implementations?
I just tested it again with my local setup, they parallelize creation of suggestions. So with the maximum number of 4 candidates there was no measurable latency penalty count in 10 tries (some tries were even faster than with only 1 candidate) for 5 different code completion setups. So thats not a factor.
@timzallmann cool, these are good news, given the small output token limit, I think it makes sense.
I will get the rails side up today in an MR, I believe @jfypk has the AI Gateway side according to the issue
@jprovaznik that's a really good catch, I tested this locally and we don't need any rails changes at all for this feature. I'll explain and close the rails issue
@oregand someone from groupai framework might be the best to answer this. My (noob) assumption is that the cost will scale linearly with getting more suggestions for each request. (to give you context, @dashaadu is thinking about offering X number of suggestions for each request (see description)
I would suggest we start by limiting the number of suggestions users can cycle through, if each request offers N suggestions this will come with both a cost and a performance impact. I would suggest we scope down to start with the lowest we're comfortable with and build up from there.
We can always work with FinOps to trend the difference in cost as needed to determine our sweet spot, though ultimately it would be latency I would worry about more than cost.
@oregand When you say you would worry about latency, are you worried that Vertex AI wouldn't be able to generate suggestions in parallel? I assumed the impact on latency would only be caused by waiting for the slowest out of X suggestions.
My thinking more or less aligns with yours perfectly i.e. the overall response time for a set of suggestions will be constrained by the slowest response among them, as you mentioned. This can introduce a significant latency if one or more tasks lag behind others, especially under high load or if specific suggestions are computationally more intensive.
But we're taking that into consideration already so we can keep that under OQ 3
The TL;DR for cost would be:
The total computation time (and therefore cost) will scale linearly with the number of suggestions, as each suggestion is a separate invocation of the model.
@oregand Just to clarify for others, this means the sum of the time of computation running in parallel (e.g. generating 3 suggestions in parallel, each suggestion takes one second to generate, we get the response with all three suggestions in one second, but the total computation time is 3 seconds.
Correct! When generating multiple suggestions in parallel, the time observed by the user until all suggestions are available (response time) may be equivalent to the time taken to generate the slowest suggestion. However, the total computational cost is indeed the sum of all individual computations.
The cost is based on input + output :"Generative AI on Vertex AI charges by every 1,000 characters of input (prompt) and every 1,000 characters of output (response)."
As output is much shorter for code completion (max tokens 64 and stop sequence on first new line) then the input , having 4 results should have a very small impact on cost. Overall the guidance is also to optimise for customer
However, the way LS works, we need to wait for all suggestions to come back before offering them to the user (I'm 90% certain, but we can investigate more). That means that if we gather 5 suggestions, 4 take 50ms, and the fifth takes 100ms to fetch; we show the user result after 100ms (the slowest one).
I would think that GitHub would want to measure that and so it will be tracking for multiple suggestions will be in some form available in VS Code. But it's probably best to leave the investigation to the spike/
Unless we start building this in all systems (AI Gateway, GitLab monolith, LS, VS Code) as the POC, in which case we need to answer all of these questions beforehand (which IMO warrants a small issue on its own).
@dashaadu I think one telemetry spike should be enough.
I'd create:
spike to "hack" support for multiple suggestions in Language Server (LS would make multiple API calls to get multiple suggestions for a single request)
spike to find out what telemetry we can add in Language Server and in VS Code
The first spike has to be finished before starting the second. WDYT?
wouldn't it be simpler to do the change on the AI Gateway side as suggested here
@ohoral I should have provided more context in the comment. A few comments above I wrote:
For POC, the extension can ask multiple times; however, for production use, we'll need to get the collection of code suggestions from AI Gateway ( groupai framework).
And by that I meant that the groupeditor extensions can do POC without changes to the APIs by asking the API multiple times. This POC would only serve for us the find out the UX of the feature, it would not be production-ready
We've been getting much more user feedback around not having the option to see more than one code suggestion at a time. Please see this internal issue for additional context from a customer. Can I please get a weigh-in on the open questions above (started them as ) but also, please add any questions I did not capture.
Once we start getting multiple suggestions from the API, it's relatively simple to include them in VS Code (VS Code is ready for it, the LS now sends list with one item and we can start sending list with multiple items without any changes to VS Code).
TL;DR: we should probably make an additional request when the user wants to cycle through suggestions
Implementing the UI is a bit more complex than I thought.
My original thought - We'll be sending 4 suggestion options every time we now send one
Reality - That's not how the Language Server Protocol is designed
LSP intended use
The LS is supposed to return a single option as the user types
Only when the user hovers over the inline suggestions (or when they press Option+]) the VS Code asks LS "Give me more suggestions", and the LS is supposed to answer with many (e.g. 10) suggestions
From the docs:
automatic trigger - Completion was triggered automatically while editing. It is sufficient to return a single completion item in this case.
invoked trigger - Completion was triggered explicitly by a user gesture. Return multiple completion items to enable cycling through them.
Docs
/** * Describes how an {@link InlineCompletionItemProvider inline completion * provider} was triggered. * * @since 3.18.0 */exportnamespaceInlineCompletionTriggerKind{/** * Completion was triggered explicitly by a user gesture. * Return multiple completion items to enable cycling through them. */exportconstInvoked:1=1;/** * Completion was triggered automatically while editing. * It is sufficient to return a single completion item in this case. */exportconstAutomatic:2=2;}
That means accepting that we'll make an additional request when the user expresses they want multiple suggestions.
Benefits
little to no extra cost because we'll only generate multiple suggestion options when the user wants them
we can generate more suggestions for the user (instead of the planned 4)
(implementation-wise): easy to track that users made the extra request
Drawback
the user has to wait extra time before getting those options
Example user interaction
user typed text
LS fetches and returns a single suggestion
The user indicates they want to cycle through (either hover over the suggestion or pressing Option+])
LS fetches and returns a suggestion with many 10+ options (this step takes extra time because of an API call)
Fight
We'll try to work around the LSP to provide multiple suggestion options instantly. This would involve some combination of caching and tracing the user's cursor.
Benefits
the moment user wants to cycle through the options, they'll be instantly ready
Drawbacks
form streaming suggestions, we know that there are UX gotchas when we start working around the LSP
we won't be able to provide as high number of suggestions as in the "Embrace" version because we would increase the response size for every suggestion request (even when the user doesn't want to cycle through more)
(implementation-wise): the engineering effort will be significantly higher
Example user interaction
user typed text
LS fetches suggestion with 4 options and returns a single suggestion
The user indicates they want to cycle through (either hover over the suggestion or pressing Option+])
LS looks into an internal cache, finds the suggestion with 4 options and returns those options (this step is instant)
@acook.gitlab@jfypk Pinging you since you work on the API, but nothing should change for you (especially since @jfypk already plans to set the candidate limit to some high number (10))
@ohoral, this affects the tracking: Knowing if the user cycled through the suggestions will be trivial. I'm not sure we'll find out how many suggestions they saw.
@viktomas As always, thank you for the fantastic write-up!
We should embrace the language server and go with option 1.
Here's how I'm thinking through it on the product side:
It gets us to our baseline feature faster
Simplifies implementation and makes our product more scalable. Also, future-proofing is a plus.
By generating multiple suggestions when the user explicitly asks, we optimize resource usage, but this is a trade-off between performance and the user experience.
Latency is the big trade-off, but bypassing the monolith should help, and we'll have an overall better user experience than what we offer today.
LMK if I missed anything. I'm really excited about this feature!
This feature has been implemented and merged to main, however, it will be released when we have telemetry implemented. You can follow the telemetry issue or the parent epic for an update on when this will be released.