Use amatch Levenshtein for calculate_diff_ratio
What does this MR do and why?
Update calculate_diff_ratio
to use the gem amatch
to calculate the diff ratio.
The current implementation is very slow for longer strings, so switching up to use this gem which should provide a much needed bump in speed and performance when doing comparisons of longer strings.
How to set up and validate locally
The output should remain identical as the only difference is the library used to generate the diff ratio.
I ran some tests locally using long strings, which helped generate some idea of the scale of improvement we might be able to achieve here.
Sample diff to run test on old vs new
diff --git a/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb b/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
index 3f38c75..1a82aad 100644
--- a/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
+++ b/lib/gitlab_quality/test_tooling/report/relate_failure_issue.rb
@@ -201,12 +201,22 @@ module GitlabQuality
def compare_stack_traces(stack_trace_first, stack_trace_second)
calculate_diff_ratio(stack_trace_first, stack_trace_second)
end
-
def calculate_diff_ratio(stack_trace_first, stack_trace_second)
distance = Levenshtein.new(stack_trace_first).match(stack_trace_second)
distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
end
+ def calculate_diff_ratio_fast(stack_trace_first, stack_trace_second)
+ distance = Levenshtein.new(stack_trace_first).match(stack_trace_second)
+ distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
+ end
+
+ def calculate_diff_ratio_slow(stack_trace_first, stack_trace_second)
+ ld = Class.new.extend(Gem::Text).method(:levenshtein_distance)
+ distance = ld.call(stack_trace_first, stack_trace_second)
+ distance.zero? ? 0.0 : (distance.to_f / stack_trace_first.size).round(3)
+ end
+
def find_relevant_failure_issues(test) # rubocop:disable Metrics/AbcSize
clean_first_test_failure_stacktrace = cleaned_stacktrace_from_test(test)
# Search with the `search` param returns 500 errors, so we filter by `base_issue_labels` and then filter further in Ruby
diff --git a/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb b/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
index f0be2e5..e1ffa62 100644
--- a/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
+++ b/spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb
@@ -864,5 +864,17 @@ RSpec.describe GitlabQuality::TestTooling::Report::RelateFailureIssue do
expect(ratio).to eq(0.286)
end
end
+
+ describe 'check speed' do
+ it 'computes a diff for a long string old' do
+ ratio = subject.send(:calculate_diff_ratio_slow, 'pattern'*2000, 'pattren'*2000)
+ expect(ratio).to eq(0.286)
+ end
+
+ it 'computes a diff for a long string new' do
+ ratio = subject.send(:calculate_diff_ratio_fast, 'pattern'*2000, 'pattren'*2000)
+ expect(ratio).to eq(0.286)
+ end
+ end
end
end
On longer strings (7000/14000 chars) -- 26.91/0.19116 = 140x / 6.69/0.04877 = 137x
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 14000
26.91 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:889
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 14000
0.19116 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:894
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 7000
6.69 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:879
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 7000
0.04877 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:884
On shorter and mid range length strings (less than 1000 chars long) the perf boost seems to be around 13x (0.20195/0.0148)
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string old 1000 chars
0.20195 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:869
GitlabQuality::TestTooling::Report::RelateFailureIssue#invoke! check speed computes a diff for a long string new 1000 chars
0.0148 seconds ./spec/gitlab_quality/test_tooling/report/relate_failure_issue_spec.rb:874
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.