Implement smart cobertura class path correction
What does this MR do?
This is based on !47097 (closed) and also addresses the feedback.
Solves #217664 (closed)
How it works
The parser will try to extract the right portion after the project's fullpath from each source
node.
For example, if the project full_path
is foo/bar
<sources>
<source>/builds/foo/bar/MyLib1</source>
<source>/builds/foo/bar/Some/foo/bar/here</source>
</sources>
We will extract ["MyLib1", "Some/foo/bar/here"]
and use these as basis to detemine the actual file path of each class relative to the project root, combining extracted source and the class filename.
Let's say we have a class with the filename=User.cs
, and the worktree paths look like ["MyLib1/User.cs", "Some/foo/bar/here/User.cs"]
, we will take the first path that matches which is Mylib1/User.cs
.
You may also notice that we have to delete
the matched class paths from the worktree context[:paths]
after we finish parsing a package's collection of classes. This is best-effort assumption to avoid conflicts with class filenames that have duplicates across multiple packages and sources.
I got the coverage MR diffs to work locally on a C# and a Java project. You can fork these projects locally to test this branch on them.
Some benchmark results
Just copied over from !47097 (closed).
pipeline.all_worktree_paths
performance on GitLab project on staging
From testing on staging, loading all_worktree_paths
on a large project like ours seems to have no performance implications.
More info
pipeline = Ci::Pipeline.find 12878474
Benchmark.bmbm do |x|
x.report("fetch gitlab worktree") { pipeline.all_worktree_paths }
end
Rehearsal ---------------------------------------------------------
fetch gitlab worktree 0.525779 0.034583 0.560362 ( 1.108381)
------------------------------------------------ total: 0.560362sec
user system total real
fetch gitlab worktree 0.000119 0.000032 0.000151 ( 0.000132)
[ gstg ] production> s=pipeline.all_worktree_paths.to_set
[ gstg ] production> s.size
=> 39495
all_worktree_paths
on a project with 500_001
files on staging
Cost of calling
[ gstg ] production> Benchmark.bmbm {|b| b.report('500_001 files') { p.repository.ls_files('master') }; }
Rehearsal -------------------------------------------------
500_001 files 0.529791 0.124296 0.654087 ( 0.716400)
---------------------------------------- total: 0.654087sec
user system total real
500_001 files 0.470483 0.028320 0.498803 ( 0.630161)
[ gstg ] production> p.repository.ls_files('master').size
=> 500001
Comparing performance between old and new implementation
The added parser behavior in this proposal, which is cross checking between sources and worktree paths,
will only be triggered when there are any extracted parts after the first project.full_path
occurrence. So
for current cobertura reports that don't have any issues, for example, RSpec, there shouldn't be any significant
change on performance.
More info
xml_data = File.read("/Users/kratos/Documents/GitLab/test/cobertura/cobertura-coverage-rspec-gitlab.xml");
Benchmark.bmbm do |x|
x.report("old parser") { Gitlab::Ci::Parsers::Coverage::CoberturaOld.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new) }
x.report("new parser") { Gitlab::Ci::Parsers::Coverage::Cobertura.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new, project_path: 'gitlab-org/gitlab', worktree_paths: []) }
end
Rehearsal ----------------------------------------------
old parser 5.591941 1.136074 6.728015 ( 6.756588)
new parser 5.537035 0.176511 5.713546 ( 5.718334)
------------------------------------ total: 12.441561sec
user system total real
old parser 5.999207 0.148476 6.147683 ( 6.149939)
new parser 6.331641 0.197576 6.529217 ( 6.537377)
A bit weird here. The there seems to be more time taken when parsing the same XML file on the 2nd run (new parser). Just to prove it's not because of the added code in the parser:
[7] pry(main)> Benchmark.bmbm {|x| x.report('old') { Hash.from_xml(xml_data)}; x.report('new') { Hash.from_xml(xml_data)}; }
Rehearsal ---------------------------------------
old 5.123571 0.145755 5.269326 ( 5.272794)
new 5.531147 0.186506 5.717653 ( 5.719541)
----------------------------- total: 10.986979sec
user system total real
old 5.048480 0.142026 5.190506 ( 5.192629)
new 6.089451 0.130987 6.220438 ( 6.222050)
The possible cause for concern in this implementation is when there are too many sources
, assuming all those sources have extractable parts, because we have to iterate through each source
for each class
node and cross-check the combined path to the given worktree paths.
Here's an example for a single source and 10_000
classes. As you will see, the performance difference between old and new is insignificant.
More info
Rehearsal ----------------------------------------------
old parser 0.574016 0.044187 0.618203 ( 0.619036)
new parser 0.566651 0.022968 0.589619 ( 0.589773)
------------------------------------- total: 1.207822sec
user system total real
old parser 0.382012 0.002439 0.384451 ( 0.384550)
new parser 0.458621 0.002361 0.460982 ( 0.461330)
Now here's an example for 100
extracted sources and 10_000
classes. This is where the parsing time is significantly slower with the new implementation. I am using 10_000
because this is close to the number of classes we currently have in our RSpec cobertura report which is around 8_000
.
More info
Rehearsal ----------------------------------------------
old parser 0.497342 0.006429 0.503771 ( 0.503547)
new parser 0.723245 0.011851 0.735096 ( 0.734848)
------------------------------------- total: 1.238867sec
user system total real
old parser 0.352771 0.000223 0.352994 ( 0.352829)
new parser 0.813122 0.018858 0.831980 ( 0.832343)
This is why we added a MAX_SOURCES
limit on the number of extracted sources we can iterate through. Beyond the limit, if there is still no match for the given file path, the class will not be included in the report.
Screenshots
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team