Hackathon - Duplicate issues
Get issues + populate local database
We created a rake task that calls the Issues API, where we fetched issues for our groups (Pipeline Authoring (laura & avielle) | Pipeline Security (max)) that had state=opened
. We then used that data to populate our local database with issues.
populate issues (shortened for clarity):
task populate_pa_issues: :environment do
access_token = 'YOUR ACCESS TOKEN'
project = Project.find_by_full_path('PROJECT PATH')
author = User.first
page = 1
issues_data = []
while issues_data.count > 0
issues_data.each do |issue|
next if Issue.find_by(project: project, weight: issue['id']).present?
p "Creating issue #{issue['id']}"
Issue.create!(
author: author,
title: issue['title'],
description: issue['description'],
project: project,
namespace: project.namespace,
weight: issue['id']
)
end
p "Fetching page #{page}"
response = Gitlab::HTTP.get(
"https://gitlab.com/api/v4/projects/278964/issues?labels=group%3A%3Apipeline%20authoring&per_page=100&page=#{page}&state=opened",
headers: { 'Authorization: Bearer' => access_token }
)
issues_data = Gitlab::Json.parse(response.body)
page += 1
end
end
embedding
and vector
to issues
Migration and setup to add We created a migration to add embedding
and vector
to the issues table. This was in order to use the neighbor gem.
class AddEmbeddingToIssues < ActiveRecord::Migration[7.1]
def change
add_column :issues, :embedding, :vector
end
end
we also added the association:
class Issue < ApplicationRecord
has_neighbors :embedding
end
We installed pgvector
and ran the migration:
rails generate neighbor:vector
rails db:migrate
Assign vectors to the title (and description)
We created a rake task that assigned the vectors to each issue title (and description).
Gitlab::Llm::VertexAi::Client.new( User.first, unit_primitive: 'documentation_search' ).text_embeddings(content: input)
Gitlab::Json.parse(response.body)["predictions"].map { |v| v["embeddings"]['values'] } end
Got the neighbors:
neighbors = issues.each_with_object(\[\]) do |issue, issues| next if issue.title.starts_with?('\[Test\]') next if discovered_issue_ids.include?(issue.id)
nearest_neighbors = issue.nearest_neighbors(:embedding, distance: 'euclidean').first(20) possible_duplicates = nearest_neighbors.filter { |neighbor| neighbor.neighbor_distance < 0.3 }
Find duplicate issues with pgvector
unless possible_duplicates.empty? discovered_issue_ids += possible_duplicates.map(&:id)
issues << ([issue.title] + possible_duplicates.map(&:title))
We then found duplicate issues by finding neighbors with different distances.
Conclusions:
- There is definitely better matches when you include
description
, which may seem obvious, but descriptions can sometimes be non-existent or unclear, it still seems to help. - Just using
title
gave us better "related issues" than "duplicates" - There is a very big "context" challenge since keywords for
gitlab-ci.yml
are regular wordsneeds
,include
and can sometimes be interpreted as part of the sentence rather than "context" - Using
embeddings
seems to work better for grouping similar than finding semantically duplicate issues - Curious to see how this compares with the current "related issues" feature that pops up when you start writing a new issue
Ideas for the future
-
@sunjungp had a great idea to do a quickaction
/duplicate
- this can be used to find other duplicate issues or close if it is suspected to be a duplicate.
Outcome
-
Improving GDK by bumping
pgvector
to latest version in order to leverage hnsw index.
Resources
- https://docs.gitlab.com/ee/development/ai_features/index.html
- https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings
- https://mikulskibartosz.name/text-search-and-duplicate-detection-with-word-embeddings-and-vector-databases
- https://en.wikipedia.org/wiki/Semantic_similarity
- https://www.timescale.com/learn/postgresql-extensions-pgvector
-
@mksionek was the best resource of all
💃
Edited by Laura Montemayor