[VS Code] Cycle Through Code Suggestions

added to epic gitlab-org/editor-extensions&50

moved from gitlab-org/editor-extensions/meta#87 (moved)

added Category:Code Suggestions Editor ExtensionsVS Code groupeditor extensions workflowrefinement labels

added Category:Editor Extensions devopscreate sectiondev labels

changed the description

Open Question 1:

What model modifications would we need to do to handle a collection of code suggestions, ensuring that it can support storing and managing multiple suggestions efficiently?

TL;DR: we can create POC in the next milestone to get it to user testing, but to make it production ready, we'll have to involve all 3 teams (groupai framework , groupcode creation, groupeditor extensions ).

For POC, the extension can ask multiple times; however, for production use, we'll need to get the collection of code suggestions from AI Gateway (groupai framework).

The API (VS Code and Language Server) that we rely on already support multiple suggestions (represented by a list) https://microsoft.github.io/language-server-protocol/specifications/lsp/3.18/specification/#textDocument_inlineCompletion.

Modifying the AI Gateway for code completion to pass candidateCount to code-gecko is far more efficient than calling multiple times, it will return multiple completions in an extra 10-20% total latency from my local tests.

Code generation is a different story, which use case are we targeting with this POC?

@acook.gitlab

We'll start with code completion. It's the highest volume and where users interact with the UI. Let me add that to the issue description for clarity.

Looking at the current logic (which may be getting into the weeds on implementation), would we have to adjust temperature, top_p, or top_k to get more results? Is there a way for us to test how many, on average, we can get back with the current thresholds?

  prompt: str,
  suffix: str,
  stream: bool = False,
  temperature: float = 0.2,
  max_output_tokens: int = 32,
  top_p: float = 0.95,
  top_k: int = 40,

@dashaadu if you add candidateCount to the preferences hash here code-gecko will already return multiple suggestions.

Then it's simply a matter of modifying the API to return more than one completion (as here it returns only the first prediction)

@acook.gitlab Ohhh, ok, that makes sense. Thank you!

@dashaadu I agree with @acook.gitlab that if we want to show multiple options, then we should just use candidateCount to request multiple completions from LLM.

Note that returning multiple options/variations is a model-specific feature and not all LLMs support it - from a brief check only code-gecko and code-bison support it, but not Anthropic's claude models which we use for code generation. This means that multiple options would be available for code completions only.

From discussion on this issue I understand that the intention was to just send multiple code suggestion requests to AI provider in parallel (both for code completions and code generations) from frontend, is that correct? Aside of increased pricing and resources, a major concern would be how you would assure that you get different variations for the same question? Although temperature param can be used to tweak randomness of response to some extent, I would expect that unless a suggestion request is very open (e.g. generating longer code), then response will be same if you run requests in parallel. Also a downside of increasing temperature would have negative impact on response accuracy.

Thought: instead of always preemptively requesting multiple variations, WDYT about just adding "reload (try again)" button which would trigger another completion request? So user wouldn't get multiple options in advance but only if explicitly requested.

Modifying the AI Gateway for code completion to pass candidateCount to code-gecko is far more efficient than calling multiple times, it will return multiple completions in an extra 10-20% total latency from my local tests.

@acook.gitlab thanks for chiming in, we are on the same page When I mentioned extension asking multiple times, I only meant in the POC to get an idea about UX. Without the need for everyone who wants to play with it to have the AI Gateway, GDK, and the extension running locally.

From discussion on this issue I understand that the intention was to just send multiple code suggestion requests to AI provider in parallel (both for code completions and code generations) from frontend, is that correct?

@jprovaznik only for POC, not for production use. For production, the AI Gateway would provide all suggestions in one response.

@jprovaznik From the product side, code ~~generation~~ completion would be where I'd like for us to start. I don't have a strong stance on the implementation, but the solution provided by Allen seems to be the less frictionless option.

Thought: instead of always preemptively requesting multiple variations, WDYT about just adding "reload (try again)" button which would trigger another completion request? So user wouldn't get multiple options in advance but only if explicitly requested.

Gotcha, so instead of saying give me x suggestions every time, we pull a new suggestion when a user requests one?

From the product side, code generation would be where I'd like for us to start.

@dashaadu I think you meant rather "code completion", right? If so, I agree it would be best to start with to test this.

It might be also worth to measure additional latency penalty (caused by requesting multiple completions) - in &12224 we invest quite a lot of effort to make code completion faster. If this change would make each request slower by e.g. 50-100ms or more, it might be a problem.

Gotcha, so instead of saying give me x suggestions every time, we pull a new suggestion when a user requests one?

Yes, exactly. I think this would be better option for "code generations" - I can imagine on "reload" we could include previous response(s) in prompt which would assure that new response is different. Also this would be compatible with streaming.

On the other side this option is not so suitable for "code completions" because for code completions we don't build a specific prompt (code-gecko expects just content above/below cursor) so I'm not sure we could use it for code completions.

@jprovaznik Yes, I meant code completion. I amended my comment above.

Does this summary cover it?

First iteration

Under Experiment/BETA in VS Code:

Start with code completion (currently using code-gecko) and use candidateCount as proposed by @acook.gitlab.
Add telemetry to measure the effects of the feature, usage, and latency

I will coordinate with FinOps to monitor the cost of adding functionality. If we get positive quantitative and qualitative feedback, this functionality will mature to GA.

Second Iteration

Under Experiment/BETA in VS Code:

Expand to code generation (via Claude 3) and pull a new generation on user request.
Add telemetry to measure usage and effects on latency.

We'll monitor the cost of the change in collaboration with FinOps.

@dashaadu sounds good to me , thank you

I'm not sure this issue is workflowready for development.

@jprovaznik I assume there has been no backend work to enable passing through candidateCount. That means I won't have an API to use for this feature.

Additionally, unless we implement Let the client (IDE) request Code Suggestions (... (&13252 - closed), we'll also have to change the monolith /api/v4/code_suggestions/completions API right?

So we are blocked by an AI Gateway issue, which we still need to create, AFAICT.

Additionally, we are blocked either on Let the client (IDE) request Code Suggestions (... (&13252 - closed) or on a monolith API issue.

/cc @acook.gitlab

@dashaadu @kisha.mavryck I'll keep you updated this week on what we'll come up with.

@viktomas yes - I'm not aware that neither Rails nor AI Gateway would be already updated to accept candidateCount - I agree this step is needed first.

I've created rails and AI gateway issues for adding candidateCount to the code completions API

Thank you @acook.gitlab!

@viktomas That makes sense, doing this now with our current implementation won't make sense since we're actively doing work to skip the monolith. We have two issues in that main epic that we also need to assist with, what about shuffling things around where we help and complete:

and then as the rest of the epic is being complete by the other teams, we tackle the two issues Allen created:

Add `candidateCount` to code suggestions API (gitlab#461702 - closed)
Add `candidateCount` to code completions API (gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#474 - closed)

WDYT? Am I missing any other dependencies ?

/cc @kisha.mavryck

@dashaadu That sounds like a great plan!

This ordering will also save us some work! If we do the work to skip monolith for code completions (Let the client (IDE) request Code Suggestions (... (&13252 - closed)), we don't have to implement the monolith issue that @acook.gitlab created Add `candidateCount` to code suggestions API (gitlab#461702 - closed) because we won't be routing the requests through monolith. win-win

/cc @jprovaznik @kisha.mavryck

Hey @dashaadu,

Ok great .
When you have an opportunity, please update the VS Code Table section of Milestone %17.1 to include these two issues, in support of the Bypass Monolith work.

Hey @viktomas

I will remove Deliverable from this issue for %17.1
I shifted Issue 182 to Deliverable and Issue 183 to Stretch

cc: @viktomas @ohoral @michaelangeloio @tristan.read

This is in our newest priority list (from last night) the number 1 priority so we want to look at accelerating delivery

Latency impact

@jprovaznik I just tested it again with my local setup, they parallelize creation of suggestions. So with the maximum number of 4 candidates there was no measurable latency penalty count in 10 tries (some tries were even faster than with only 1 candidate) for 5 different code completion setups. So thats not a factor. @acook.gitlab has also a test setup to test latency broader i believe if we would measure it there again. Official API docs here - https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/code-completion

@acook.gitlab + @jprovaznik I would still suggest we implement it in both APIs and complete both API endpoints. So we have more flexibility on the IDE side with implementation (as we might have IDEs that don't implement straight away the code completion bypass but would want to have multi suggestions). Can you get an ETA for both the AI Gateway (or do you need ai-framework help on it) + Monolith extensions? Thanks!

@viktomas Makes sense to go for VSCode rather with the bypass logic, do you have a suggestion how to parallelise both things or accelerate? (as i am working on bringing more capacity for editor extension work) Only worry that i have is that due to the dependencies and not a clear rollout plan we might be blocked for longer then expected and need to clarify the priority of it (we will also try to accelerate delivery there). Could you have a very rough estimate how much work it would be to do both implementations?

I will get the rails side up today in an MR, I believe @jfypk has the AI Gateway side according to the issue

cc @oregand

A big thank you to @jfypk for helping out here on the AIGW side!

Thank you @oregand, @jfypk, & @acook.gitlab for the collaborative support

I just tested it again with my local setup, they parallelize creation of suggestions. So with the maximum number of 4 candidates there was no measurable latency penalty count in 10 tries (some tries were even faster than with only 1 candidate) for 5 different code completion setups. So thats not a factor.

@timzallmann cool, these are good news, given the small output token limit, I think it makes sense.

I will get the rails side up today in an MR, I believe @jfypk has the AI Gateway side according to the issue

@acook.gitlab I think that any extra/additional params sent by client are just forwarded to AI GW (https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/api/code_suggestions.rb#L104), so maybe no change on Rails side should be needed.

@jprovaznik that's a really good catch, I tested this locally and we don't need any rails changes at all for this feature. I'll explain and close the rails issue

Open Question 2:

Should we limit the number of suggestions users can cycle through?

@oregand someone from groupai framework might be the best to answer this. My (noob) assumption is that the cost will scale linearly with getting more suggestions for each request. (to give you context, @dashaadu is thinking about offering X number of suggestions for each request (see description)

@viktomas @dashaadu

I would suggest we start by limiting the number of suggestions users can cycle through, if each request offers N suggestions this will come with both a cost and a performance impact. I would suggest we scope down to start with the lowest we're comfortable with and build up from there.

We can always work with FinOps to trend the difference in cost as needed to determine our sweet spot, though ultimately it would be latency I would worry about more than cost.

@oregand When you say you would worry about latency, are you worried that Vertex AI wouldn't be able to generate suggestions in parallel? I assumed the impact on latency would only be caused by waiting for the slowest out of X suggestions.

@viktomas

My thinking more or less aligns with yours perfectly i.e. the overall response time for a set of suggestions will be constrained by the slowest response among them, as you mentioned. This can introduce a significant latency if one or more tasks lag behind others, especially under high load or if specific suggestions are computationally more intensive.

But we're taking that into consideration already so we can keep that under OQ 3

The TL;DR for cost would be:

The total computation time (and therefore cost) will scale linearly with the number of suggestions, as each suggestion is a separate invocation of the model.

The total computation time (and therefore cost)

@oregand Just to clarify for others, this means the sum of the time of computation running in parallel (e.g. generating 3 suggestions in parallel, each suggestion takes one second to generate, we get the response with all three suggestions in one second, but the total computation time is 3 seconds.

@viktomas

Correct! When generating multiple suggestions in parallel, the time observed by the user until all suggestions are available (response time) may be equivalent to the time taken to generate the slowest suggestion. However, the total computational cost is indeed the sum of all individual computations.

@oregand Appreciate the insights! For an MVC, it'll depend on the quality of the subsequent suggestion, but somewhere around 3 would be the start.

The API has a limitation of a maximum of 4 - https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/code-completion (look for candidateCount)

The cost is based on input + output :"Generative AI on Vertex AI charges by every 1,000 characters of input (prompt) and every 1,000 characters of output (response)."

As output is much shorter for code completion (max tokens 64 and stop sequence on first new line) then the input , having 4 results should have a very small impact on cost. Overall the guidance is also to optimise for customer

Source: https://cloud.google.com/vertex-ai/generative-ai/pricing#:~:text=Generative%20AI%20on%20Vertex%20AI%20charges%20by%20every%201%2C000%20characters%20of%20input%20(prompt)%20and%20every%201%2C000%20characters%20of%20output%20(response).

Open Question 3:

How would this impact our latency metrics?

As I mentioned in #1325 (comment 1875592548), I'd be worried about cost rather than latency.

However, the way LS works, we need to wait for all suggestions to come back before offering them to the user (I'm 90% certain, but we can investigate more). That means that if we gather 5 suggestions, 4 take 50ms, and the fifth takes 100ms to fetch; we show the user result after 100ms (the slowest one).

Open Question 4:

How would this affect how we do caching?

Open Question 5:

Can we implement telemetry to track which suggestion is accepted (initially shown vs. second or third)?

I would think that GitHub would want to measure that and so it will be tracking for multiple suggestions will be in some form available in VS Code. But it's probably best to leave the investigation to the spike/

Unless we start building this in all systems (AI Gateway, GitLab monolith, LS, VS Code) as the POC, in which case we need to answer all of these questions beforehand (which IMO warrants a small issue on its own).

@viktomas Makes sense. For my understanding, would it look like this?

Spike to identify feasible telemetry
Spike to where we'll instrument telemetry

@dashaadu I think one telemetry spike should be enough.

I'd create:

spike to "hack" support for multiple suggestions in Language Server (LS would make multiple API calls to get multiple suggestions for a single request)
spike to find out what telemetry we can add in Language Server and in VS Code

The first spike has to be finished before starting the second. WDYT?

/cc @ohoral as our resident telemetry guru

@viktomas

LS would make multiple API calls to get multiple suggestions for a single request

wouldn't it be simpler to do the change on the AI Gateway side as suggested here

spike to find out what telemetry we can add in Language Server and in VS Code

sounds good

wouldn't it be simpler to do the change on the AI Gateway side as suggested here

@ohoral I should have provided more context in the comment. A few comments above I wrote:

For POC, the extension can ask multiple times; however, for production use, we'll need to get the collection of code suggestions from AI Gateway ( groupai framework).

And by that I meant that the groupeditor extensions can do POC without changes to the APIs by asking the API multiple times. This POC would only serve for us the find out the UX of the feature, it would not be production-ready

AI Gateway side as suggested here

Also, the link is probably to the wrong URL (points to my comment above).

Also, the link is probably to the wrong URL (points to my comment above).

@viktomas

right, this one

This POC would only serve for us the find out the UX of the feature

for the ~UX-only we can duplicate the response on the LS side or just add a couple of mock completion items when returning to the client

@gitlab-org/editor-extensions/team & @code-creation-team:

We've been getting much more user feedback around not having the option to see more than one code suggestion at a time. Please see this internal issue for additional context from a customer. Can I please get a weigh-in on the open questions above (started them as ) but also, please add any questions I did not capture.

Also, please take a look at the epic for more details. Goal is to have an MVC defined to start for 17.1.

mentioned in issue gitlab#456112

changed the description

changed milestone to %17.1

removed workflowrefinement label

added workflowready for development label

@dashaadu, thank you for moving this issue to workflowready for development.

Please add a type:: label to indicate the work type classification.

This message was generated automatically. You're welcome to improve it.

added typefeature label

added twtriaged label

I'm unsure if we'll need docs here, given that I think we're just going for a proof of concept.

mentioned in issue gitlab-org/quality/triage-reports#17611 (closed)

mentioned in issue gitlab-org/editor-extensions/meta#90 (closed)

mentioned in issue gitlab-org/quality/triage-reports#17714 (closed)

assigned to @viktomas

Hey @viktomas,

Shift this to a Deliverable for Milestone %17.1
Please provide a weight when you have an opportunity.

Thank you

cc: @ohoral @tristan.read @dashaadu

Update: Removing this issue as a %17.1 Deliverable for now.

For context, please see the notes added to this section here

added Deliverable label

Setting health status to on track as the milestone has just begun.

Issue participants are welcome to override this by setting the health status to another value.

changed health status to on track

mentioned in issue gitlab-com/Product#13349 (closed)

mentioned in issue gitlab#461702 (closed)

mentioned in issue gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#474 (closed)

removed workflowready for development label

added workflowrefinement label

removed Deliverable label

added Deliverable label

removed Deliverable label

set weight to 2

Status update 2024-05-16

The parent epic contains all issues needed to deliver cycling through suggestions Show Multiple Code Suggestions MVC (gitlab-org/editor-extensions&50)

Once we start getting multiple suggestions from the API, it's relatively simple to include them in VS Code (VS Code is ready for it, the LS now sends list with one item and we can start sending list with multiple items without any changes to VS Code).

The minimal POC is in tv/2024-05/multiple-suggestions-poc branch

(I added weight 2, so we can also do a cleanup, but the cleanup can be done as a follow up)

I created a new issue to add telemetry tracking: [VS Code]: Telemetry for multiple suggestions. I'd really appreciate @ohoral's help with it once we are ready to pick it up.

/cc @kisha.mavryck @dashaadu @timzallmann @acook.gitlab @jprovaznik

Thank you for the update @viktomas.

I assign @ohoral to this issue, as well. Just in case additional bandwidth is required

@kisha.mavryck we've just had a sync

@viktomas will work on this issue. There is not much I can help here with. And I'll work on the related telemetry issue #1376 (closed)

Unassigning myself here

Awesome. Thank you @ohoral & @viktomas

assigned to @ohoral

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@54f78540

mentioned in merge request gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!822 (merged)

unassigned @ohoral

mentioned in merge request gitlab-org/editor-extensions/gitlab-lsp!328 (closed)

Status update 2024-05-17

TL;DR: we should probably make an additional request when the user wants to cycle through suggestions

Implementing the UI is a bit more complex than I thought.

My original thought - We'll be sending 4 suggestion options every time we now send one
Reality - That's not how the Language Server Protocol is designed

LSP intended use

The LS is supposed to return a single option as the user types
Only when the user hovers over the inline suggestions (or when they press Option+]) the VS Code asks LS "Give me more suggestions", and the LS is supposed to answer with many (e.g. 10) suggestions

From the docs:

automatic trigger - Completion was triggered automatically while editing. It is sufficient to return a single completion item in this case.

invoked trigger - Completion was triggered explicitly by a user gesture. Return multiple completion items to enable cycling through them.

Docs

/**
 * Describes how an {@link InlineCompletionItemProvider inline completion
 * provider} was triggered.
 *
 * @since 3.18.0
 */
export namespace InlineCompletionTriggerKind {
	/**
	 * Completion was triggered explicitly by a user gesture.
	 * Return multiple completion items to enable cycling through them.
	 */
	export const Invoked: 1 = 1;

	/**
	 * Completion was triggered automatically while editing.
	 * It is sufficient to return a single completion item in this case.
	 */
	export const Automatic: 2 = 2;
}

source

What does that mean for us

We can either embrace or fight the LSP.

Embrace

That means accepting that we'll make an additional request when the user expresses they want multiple suggestions.

Benefits
- little to no extra cost because we'll only generate multiple suggestion options when the user wants them
- we can generate more suggestions for the user (instead of the planned 4)
- (implementation-wise): easy to track that users made the extra request
Drawback
- the user has to wait extra time before getting those options

Example user interaction

user typed text
LS fetches and returns a single suggestion
The user indicates they want to cycle through (either hover over the suggestion or pressing Option+])
LS fetches and returns a suggestion with many 10+ options (this step takes extra time because of an API call)

Fight

We'll try to work around the LSP to provide multiple suggestion options instantly. This would involve some combination of caching and tracing the user's cursor.

Benefits
- the moment user wants to cycle through the options, they'll be instantly ready
Drawbacks
- form streaming suggestions, we know that there are UX gotchas when we start working around the LSP
- we won't be able to provide as high number of suggestions as in the "Embrace" version because we would increase the response size for every suggestion request (even when the user doesn't want to cycle through more)
- (implementation-wise): the engineering effort will be significantly higher

Example user interaction

user typed text
LS fetches suggestion with 4 options and returns a single suggestion
The user indicates they want to cycle through (either hover over the suggestion or pressing Option+])
LS looks into an internal cache, finds the suggestion with 4 options and returns those options (this step is instant)

Pings

@dashaadu I'll need your feedback on this

@acook.gitlab @jfypk Pinging you since you work on the API, but nothing should change for you (especially since @jfypk already plans to set the candidate limit to some high number (10))

FYI: @timzallmann @kisha.mavryck @mnohr @oregand

@ohoral, this affects the tracking: Knowing if the user cycled through the suggestions will be trivial. I'm not sure we'll find out how many suggestions they saw.

@viktomas As always, thank you for the fantastic write-up!

We should embrace the language server and go with option 1.

Here's how I'm thinking through it on the product side:

It gets us to our baseline feature faster
Simplifies implementation and makes our product more scalable. Also, future-proofing is a plus.
By generating multiple suggestions when the user explicitly asks, we optimize resource usage, but this is a trade-off between performance and the user experience.

Latency is the big trade-off, but bypassing the monolith should help, and we'll have an overall better user experience than what we offer today.

LMK if I missed anything. I'm really excited about this feature!

/cc @ohoral

marked this issue as related to #1376 (closed)

[VS Code] Cycle Through Code Suggestions

MVC Proposal

Open Questions (see )

Designs

Child items ...

Activity

Open Question 1:

First iteration

Second Iteration

Open Question 2:

Open Question 3:

Open Question 4:

Open Question 5:

Status update 2024-05-16

Status update 2024-05-17

LSP intended use

What does that mean for us

Embrace

Fight

Pings

[VS Code] Cycle Through Code Suggestions

MVC Proposal

Open Questions (see )

Relates to

Activity

Open Question 1:

First iteration

Second Iteration

Open Question 2:

Open Question 3:

Open Question 4:

Open Question 5:

Status update 2024-05-16

Status update 2024-05-17

LSP intended use

What does that mean for us

Embrace

Fight

Pings