Skip to content

Hackathon - Duplicate issues

Get issues + populate local database

We created a rake task that calls the Issues API, where we fetched issues for our groups (Pipeline Authoring (laura & avielle) | Pipeline Security (max)) that had state=opened. We then used that data to populate our local database with issues.

populate issues (shortened for clarity):

  task populate_pa_issues: :environment do
    access_token = 'YOUR ACCESS TOKEN'
    project = Project.find_by_full_path('PROJECT PATH')
    author = User.first
    page = 1
    issues_data = []

    while issues_data.count > 0
      issues_data.each do |issue|
        next if Issue.find_by(project: project, weight: issue['id']).present?

        p "Creating issue #{issue['id']}"

        Issue.create!(
          author: author,
          title: issue['title'],
          description: issue['description'],
          project: project,
          namespace: project.namespace,
          weight: issue['id']
        )
      end

      p "Fetching page #{page}"
      response = Gitlab::HTTP.get(
        "https://gitlab.com/api/v4/projects/278964/issues?labels=group%3A%3Apipeline%20authoring&per_page=100&page=#{page}&state=opened",
        headers: { 'Authorization: Bearer' => access_token }
      )
      issues_data = Gitlab::Json.parse(response.body)

      page += 1
    end
  end

Migration and setup to add embedding and vector to issues

We created a migration to add embedding and vector to the issues table. This was in order to use the neighbor gem.

class AddEmbeddingToIssues < ActiveRecord::Migration[7.1]
  def change
    add_column :issues, :embedding, :vector
  end
end

we also added the association:

class Issue < ApplicationRecord
  has_neighbors :embedding
end

We installed pgvector and ran the migration:

rails generate neighbor:vector 
rails db:migrate

Assign vectors to the title (and description)

We created a rake task that assigned the vectors to each issue title (and description).

Gitlab::Llm::VertexAi::Client.new( User.first, unit_primitive: 'documentation_search' ).text_embeddings(content: input)

Gitlab::Json.parse(response.body)["predictions"].map { |v| v["embeddings"]['values'] } end

Got the neighbors:

neighbors = issues.each_with_object(\[\]) do |issue, issues| next if issue.title.starts_with?('\[Test\]') next if discovered_issue_ids.include?(issue.id)

nearest_neighbors = issue.nearest_neighbors(:embedding, distance: 'euclidean').first(20) possible_duplicates = nearest_neighbors.filter { |neighbor| neighbor.neighbor_distance < 0.3 }

Find duplicate issues with pgvector

unless possible_duplicates.empty? discovered_issue_ids += possible_duplicates.map(&:id)
issues << ([issue.title] + possible_duplicates.map(&:title))

We then found duplicate issues by finding neighbors with different distances.

Conclusions:

  • There is definitely better matches when you include description, which may seem obvious, but descriptions can sometimes be non-existent or unclear, it still seems to help.
  • Just using title gave us better "related issues" than "duplicates"
  • There is a very big "context" challenge since keywords for gitlab-ci.yml are regular words needs, include and can sometimes be interpreted as part of the sentence rather than "context"
  • Using embeddings seems to work better for grouping similar than finding semantically duplicate issues
  • Curious to see how this compares with the current "related issues" feature that pops up when you start writing a new issue

Ideas for the future

  • @sunjungp had a great idea to do a quickaction /duplicate - this can be used to find other duplicate issues or close if it is suspected to be a duplicate.

Outcome

Resources

Edited by Laura Montemayor