Skip to content

Use amatch Levenshtein for calculate_diff_ratio

John McDonnell requested to merge jmd/improve-speed-of-calculate-diff-ratio into main

What does this MR do and why?

Update calculate_diff_ratio to use the gem amatch to calculate the diff ratio.
The current implementation is very slow for longer strings, so switching up to use this gem which should provide a much needed bump in speed and performance when doing comparisons of longer strings.

How to set up and validate locally

The output should remain identical as the only difference is the library used to generate the diff ratio.
I ran some tests locally using long strings, which helped generate some idea of the scale of improvement we might be able to achieve here.

Sample diff to run test on old vs new
diff --git a/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb b/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
index 3f38c75..1a82aad 100644
--- a/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
+++ b/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
@@ -201,12 +201,22 @@ module GitlabQuality
         def compare_stack_traces(stack_trace_first, stack_trace_second)
           calculate_diff_ratio(stack_trace_first, stack_trace_second)
         end
-
         def calculate_diff_ratio(stack_trace_first, stack_trace_second)
           distance = Levenshtein.new(stack_trace_first).match(stack_trace_second)
           distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
         end
 
+        def calculate_diff_ratio_fast(stack_trace_first, stack_trace_second)
+          distance = Levenshtein.new(stack_trace_first).match(stack_trace_second)
+          distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
+        end
+
+        def calculate_diff_ratio_slow(stack_trace_first, stack_trace_second)
+          ld = Class.new.extend(Gem::Text).method(:levenshtein_distance)
+          distance = ld.call(stack_trace_first, stack_trace_second)
+          distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
+        end
+
         def find_relevant_failure_issues(test) # rubocop:disable Metrics/AbcSize
           clean_first_test_failure_stacktrace = cleaned_stacktrace_from_test(test)
           # Search with the `search` param returns 500 errors, so we filter by `base_issue_labels` and then filter further in Ruby
diff --git a/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb b/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
index f0be2e5..e1ffa62 100644
--- a/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
+++ b/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
@@ -864,5 +864,17 @@ RSpec.describe GitlabQuality::TestTooling::Report::RelateFailureIssue do
         expect(ratio).to eq(0.286)
       end
     end
+
+    describe 'check speed' do
+      it 'computes a diff for a long string old' do
+        ratio = subject.send(:calculate_diff_ratio_slow, 'pattern'*2000, 'pattren'*2000)
+        expect(ratio).to eq(0.286)
+      end
+
+      it 'computes a diff for a long string new' do
+        ratio = subject.send(:calculate_diff_ratio_fast, 'pattern'*2000, 'pattren'*2000)
+        expect(ratio).to eq(0.286)
+      end
+    end
   end
 end

On longer strings (7000/14000 chars) -- 26.91/0.19116 = 140x / 6.69/0.04877 = 137x

  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 14000
    26.91 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:889
  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 14000
    0.19116 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:894

  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 7000
    6.69 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:879
  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 7000
    0.04877 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:884

On shorter and mid range length strings (less than 1000 chars long) the perf boost seems to be around 13x (0.20195/0.0148)

  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 1000 chars
    0.20195 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:869
  GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 1000 chars
    0.0148 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:874

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by John McDonnell

Merge request reports