Hackathon - Duplicate issues

Get issues + populate local database

We created a rake task that calls the Issues API, where we fetched issues for our groups (Pipeline Authoring (laura & avielle) | Pipeline Security (max)) that had state=opened. We then used that data to populate our local database with issues.

populate issues (shortened for clarity):

  task populate_pa_issues: :environment do
    access_token = 'YOUR ACCESS TOKEN'
    project = Project.find_by_full_path('PROJECT PATH')
    author = User.first
    page = 1
    issues_data = []

    while issues_data.count > 0
      issues_data.each do |issue|
        next if Issue.find_by(project: project, weight: issue['id']).present?

        p "Creating issue #{issue['id']}"

        Issue.create!(
          author: author,
          title: issue['title'],
          description: issue['description'],
          project: project,
          namespace: project.namespace,
          weight: issue['id']
        )
      end

      p "Fetching page #{page}"
      response = Gitlab::HTTP.get(
        "https://gitlab.com/api/v4/projects/278964/issues?labels=group%3A%3Apipeline%20authoring&per_page=100&page=#{page}&state=opened",
        headers: { 'Authorization: Bearer' => access_token }
      )
      issues_data = Gitlab::Json.parse(response.body)

      page += 1
    end
  end

Migration and setup to add `embedding` and `vector` to issues

We created a migration to add embedding and vector to the issues table. This was in order to use the neighbor gem.

class AddEmbeddingToIssues < ActiveRecord::Migration[7.1]
  def change
    add_column :issues, :embedding, :vector
  end
end

we also added the association:

class Issue < ApplicationRecord
  has_neighbors :embedding
end

We installed pgvector and ran the migration:

rails generate neighbor:vector 
rails db:migrate

Assign vectors to the title (and description)

We created a rake task that assigned the vectors to each issue title (and description).

Gitlab::Llm::VertexAi::Client.new( User.first, unit_primitive: 'documentation_search' ).text_embeddings(content: input)

Gitlab::Json.parse(response.body)["predictions"].map { |v| v["embeddings"]['values'] } end

Got the neighbors:

neighbors = issues.each_with_object(\[\]) do |issue, issues| next if issue.title.starts_with?('\[Test\]') next if discovered_issue_ids.include?(issue.id)

nearest_neighbors = issue.nearest_neighbors(:embedding, distance: 'euclidean').first(20) possible_duplicates = nearest_neighbors.filter { |neighbor| neighbor.neighbor_distance < 0.3 }

Find duplicate issues with pgvector

unless possible_duplicates.empty? discovered_issue_ids += possible_duplicates.map(&:id)
issues << ([issue.title] + possible_duplicates.map(&:title))

We then found duplicate issues by finding neighbors with different distances.

Conclusions:

There is definitely better matches when you include description, which may seem obvious, but descriptions can sometimes be non-existent or unclear, it still seems to help.
Just using title gave us better "related issues" than "duplicates"
There is a very big "context" challenge since keywords for gitlab-ci.yml are regular words needs, include and can sometimes be interpreted as part of the sentence rather than "context"
Using embeddings seems to work better for grouping similar than finding semantically duplicate issues
Curious to see how this compares with the current "related issues" feature that pops up when you start writing a new issue

Ideas for the future

@sunjungp had a great idea to do a quickaction /duplicate - this can be used to find other duplicate issues or close if it is suspected to be a duplicate.

Outcome

Improving GDK by bumping pgvector to latest version in order to leverage hnsw index.

Resources

Edited Jun 25, 2024 by Laura Montemayor