Use LLM to extract relevant context information for code generation prompts

Background

Code creation prompts in order to yield best results should provide LLM model with reach and relevant context. By the way of analogy if one would like a human developer to write code that stand up to quality, security and style standards they need to inform the developer what those standards are.

Goal

Use LLM to extract relevant information from codebase, eg: available 3rd party libraries, public methods, used framework etc. That information should be stored and used to enrich prompts when it is relevant.

There is a parallel epic &11568 (closed) that focuses on using Tree-sitter parsing tool to build abstract syntax trees and query them for required information, however that tool creates different technical challenges, therefore this issue is proposed to offer an alternative (or a complementary) solution

An advantage of proposed solution is that it should require less effort to implement then Tree-sitter one, and what is even more important it can provide summarised explanation what different modules do, that can be later passed to code creation prompt.

Technical overview

LLMs are good at both parsing and understanding natural language text input, as well as source code. With correct prompts LLM is not only able to write code, but also to explain and summarise it. This ability can be used to parse and extract relevant information from source code of repositories that opted into code creation tool.

Example prompt for Anthropic

Blow issue present example prompt for claude models from Anthropic that extracts:

  1. All defined constants. Include constants: name, value, type, position in source code indexing lines and columns from 0.
  2. All defined functions. Include functions: name, parameters, returned value, description of behaviour it implements
Human: From a TypeScript code provided in <code src="hello.ts"></code> XML tags.

Extract following information:
1. All defined constants. Include constants: name, value, type, position in source code indexing lines and columns from 0.
1. All defined functions. Include functions: name, parameters, returned value, description of behaviour it implements

Structure your answer using following example
<program src="some_file.ts">
  <constants>
    <constant name="constant name" type="string" line="0" column="0">"hello world"</constant>
    <constant name="constant name1" type="number" line="0" column="1">1</constant>
  </constants>
  <functions>
    <function name="test">
      <parameters>
        <parameter name="greeting" type="string"></parameter>
      </parameters>
      <description>
         This function outputs to console a message passed in with greeting parameter
      </description>
      <return>void</return>
    </function>
  </functions>
</program>

<code src="snowplow.ts">
</code>

With code

Click to expand
import {
  Payload,
  PayloadBuilder,
  SelfDescribingJson,
  StructuredEvent,
  buildStructEvent,
  trackerCore,
  TrackerCore,
} from '@snowplow/tracker-core';
import fetch from 'cross-fetch';
import { v4 as uuidv4 } from 'uuid';
import { Emitter } from './emitter';
import { log } from '../log';
import { SnowplowOptions } from './snowplow_options';

/**
 * Adds the 'stm' paramater with the current time to the payload
 * Stringyfiy all payload values
 * @param payload - The payload which will be mutated
 */
function preparePayload(payload: Payload): Record<string, string> {
  const stringifiedPayload: Record<string, string> = {};

  Object.keys(payload).forEach(key => {
    stringifiedPayload[key] = String(payload[key]);
  });

  stringifiedPayload.stm = new Date().getTime().toString();

  return stringifiedPayload;
}

export class Snowplow {
  private emitter: Emitter;

  private options: SnowplowOptions;

  private tracker: TrackerCore;

  // eslint-disable-next-line no-use-before-define
  private static instance: Snowplow;

  private constructor(options: SnowplowOptions) {
    this.options = options;
    this.emitter = new Emitter(
      this.options.timeInterval,
      this.options.maxItems,
      this.sendEvent.bind(this),
    );
    this.emitter.start();
    this.tracker = trackerCore({ callback: this.emitter.add.bind(this.emitter) });
  }

  public static getInstance(options?: SnowplowOptions): Snowplow {
    if (!this.instance) {
      if (!options) {
        throw new Error('Snowplow should be instantiated');
      }
      const sp = new Snowplow(options);
      Snowplow.instance = sp;
    }

    return Snowplow.instance;
  }

  private async sendEvent(events: PayloadBuilder[]): Promise<void> {
    if (!this.options.enabled()) {
      return;
    }

    const url = `${this.options.endpoint}/com.snowplowanalytics.snowplow/tp2`;
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        schema: 'iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4',
        data: events.map(event => {
          const eventId = uuidv4();
          // All values prefieled below are part of snowplow tracker protocol
          // https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/snowplow-tracker-protocol/#common-parameters
          // Values are set according to either common GitLab standard:
          // tna - representing tracker namespace and being set across GitLab to "gl"
          // tv - represents tracker value, to make it aligned with downstream system it has to be prefixed with "js-*""
          // aid - represents app Id is configured via options to gitlab_ide_extension
          // eid - represents uuid for each emitted event
          event.add('eid', eventId);
          event.add('p', 'app');
          event.add('tv', 'js-gitlab');
          event.add('tna', 'gl');
          event.add('aid', this.options.appId);

          return preparePayload(event.build());
        }),
      }),
    });

    if (response.status !== 200) {
      log.warn(`Could not send telmetry to snowplow status=${response.status}`);
    }
  }

  public async trackStructEvent(
    event: StructuredEvent,
    context?: SelfDescribingJson[] | null,
  ): Promise<void> {
    this.tracker.track(buildStructEvent(event), context);
  }

  async stop() {
    await this.emitter.stop();
  }
}

That produced

<program src="snowplow.ts">
  <constants>
    <constant name="timeInterval" type="number" line="7" column="4"></constant>
    <constant name="maxItems" type="number" line="8" column="4"></constant>
  </constants>
  <functions>
    <function name="preparePayload">
      <parameters>
        <parameter name="payload" type="Payload"></parameter>
      </parameters>
      <description>
        Mutates the payload by stringifying all values and adding a timestamp property
      </description>
      <return>Record&lt;string, string&gt;</return> 
    </function>
    <function name="sendEvent">
      <parameters>
        <parameter name="events" type="PayloadBuilder[]"></parameter>
      </parameters>
      <description>
        Sends tracking events to the Snowplow endpoint if enabled
      </description>
      <return>Promise&lt;void&gt;</return>
    </function>
    <function name="trackStructEvent">
      <parameters>
        <parameter name="event" type="StructuredEvent"></parameter>
        <parameter name="context" type="SelfDescribingJson[] | null"></parameter>
      </parameters>
      <description>
        Tracks a Snowplow struct event
      </description>
      <return>Promise&lt;void&gt;</return>
    </function>
    <function name="stop">
      <description>
        Stops the emitter
      </description>
      <return>Promise</return>
    </function>
  </functions>
</program>

Open questions

1. Where and how to store parsed information?

@timzallmann mentioned graph databases being a good candidate to not only store information extracted from the codebase, but also to model relations between different entities like function calls. We need to evaluate what solutions can be used and how they could be integrated with the current GitLab stack

2. Asses performance for large repositories

3. Would any references in the code that aren't defined in the request cause hallucinations?

4. Verify if https://docs.gitlab.com/ee/architecture/blueprints/code_search_with_zoekt/ can be combined with LLM output to build references graph

5. How tolerant is it of invalid code? i.e. code that's still being written / refactored

Edited by Mikołaj Wawrzyniak