Skip to content

Implement smart cobertura class path correction

Erick Bajao requested to merge eb-cobertura-background-fix into master

What does this MR do?

This is based on !47097 (closed) and also addresses the feedback.

Solves #217664 (closed)

How it works

The parser will try to extract the right portion after the project's fullpath from each source node.

For example, if the project full_path is foo/bar

<sources>
  <source>/builds/foo/bar/MyLib1</source>
  <source>/builds/foo/bar/Some/foo/bar/here</source>
</sources>

We will extract ["MyLib1", "Some/foo/bar/here"] and use these as basis to detemine the actual file path of each class relative to the project root, combining extracted source and the class filename.

Let's say we have a class with the filename=User.cs, and the worktree paths look like ["MyLib1/User.cs", "Some/foo/bar/here/User.cs"], we will take the first path that matches which is Mylib1/User.cs.

You may also notice that we have to delete the matched class paths from the worktree context[:paths] after we finish parsing a package's collection of classes. This is best-effort assumption to avoid conflicts with class filenames that have duplicates across multiple packages and sources.

I got the coverage MR diffs to work locally on a C# and a Java project. You can fork these projects locally to test this branch on them.

Some benchmark results

Just copied over from !47097 (closed).

pipeline.all_worktree_paths performance on GitLab project on staging

From testing on staging, loading all_worktree_paths on a large project like ours seems to have no performance implications.

More info

pipeline = Ci::Pipeline.find 12878474
Benchmark.bmbm do |x|
x.report("fetch gitlab worktree") { pipeline.all_worktree_paths }
end

Rehearsal ---------------------------------------------------------
fetch gitlab worktree   0.525779   0.034583   0.560362 (  1.108381)
------------------------------------------------ total: 0.560362sec
                            user     system      total        real
fetch gitlab worktree   0.000119   0.000032   0.000151 (  0.000132)

[ gstg ] production> s=pipeline.all_worktree_paths.to_set
[ gstg ] production> s.size
=> 39495

Cost of calling all_worktree_paths on a project with 500_001 files on staging


[ gstg ] production> Benchmark.bmbm {|b| b.report('500_001 files') { p.repository.ls_files('master') }; }
Rehearsal -------------------------------------------------
500_001 files   0.529791   0.124296   0.654087 (  0.716400)
---------------------------------------- total: 0.654087sec
                user     system      total        real

500_001 files   0.470483   0.028320   0.498803 (  0.630161)
[ gstg ] production> p.repository.ls_files('master').size
=> 500001

Comparing performance between old and new implementation

The added parser behavior in this proposal, which is cross checking between sources and worktree paths, will only be triggered when there are any extracted parts after the first project.full_path occurrence. So for current cobertura reports that don't have any issues, for example, RSpec, there shouldn't be any significant change on performance.

More info

xml_data = File.read("/Users/kratos/Documents/GitLab/test/cobertura/cobertura-coverage-rspec-gitlab.xml");
Benchmark.bmbm do |x|
x.report("old parser") { Gitlab::Ci::Parsers::Coverage::CoberturaOld.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new) }
x.report("new parser") { Gitlab::Ci::Parsers::Coverage::Cobertura.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new, project_path: 'gitlab-org/gitlab', worktree_paths: []) }
end
Rehearsal ----------------------------------------------
old parser   5.591941   1.136074   6.728015 (  6.756588)
new parser   5.537035   0.176511   5.713546 (  5.718334)
------------------------------------ total: 12.441561sec
             user     system      total        real

old parser   5.999207   0.148476   6.147683 (  6.149939)
new parser   6.331641   0.197576   6.529217 (  6.537377)
A bit weird here. The there seems to be more time taken when parsing the same XML file on the 2nd run (new parser). Just to prove it's not because of the added code in the parser:
[7] pry(main)> Benchmark.bmbm {|x| x.report('old') { Hash.from_xml(xml_data)}; x.report('new') { Hash.from_xml(xml_data)}; }
Rehearsal ---------------------------------------
old   5.123571   0.145755   5.269326 (  5.272794)
new   5.531147   0.186506   5.717653 (  5.719541)
----------------------------- total: 10.986979sec
      user     system      total        real

old   5.048480   0.142026   5.190506 (  5.192629)
new   6.089451   0.130987   6.220438 (  6.222050)

The possible cause for concern in this implementation is when there are too many sources, assuming all those sources have extractable parts, because we have to iterate through each source for each class node and cross-check the combined path to the given worktree paths.

Here's an example for a single source and 10_000 classes. As you will see, the performance difference between old and new is insignificant.

More info

Rehearsal ----------------------------------------------
old parser   0.574016   0.044187   0.618203 (  0.619036)
new parser   0.566651   0.022968   0.589619 (  0.589773)
------------------------------------- total: 1.207822sec
             user     system      total        real

old parser   0.382012   0.002439   0.384451 (  0.384550)
new parser   0.458621   0.002361   0.460982 (  0.461330)

Now here's an example for 100 extracted sources and 10_000 classes. This is where the parsing time is significantly slower with the new implementation. I am using 10_000 because this is close to the number of classes we currently have in our RSpec cobertura report which is around 8_000.

More info

Rehearsal ----------------------------------------------
old parser   0.497342   0.006429   0.503771 (  0.503547)
new parser   0.723245   0.011851   0.735096 (  0.734848)
------------------------------------- total: 1.238867sec
             user     system      total        real

old parser   0.352771   0.000223   0.352994 (  0.352829)
new parser   0.813122   0.018858   0.831980 (  0.832343)

This is why we added a MAX_SOURCES limit on the number of extracted sources we can iterate through. Beyond the limit, if there is still no match for the given file path, the class will not be included in the report.

Screenshots

Screen_Shot_2020-11-20_at_1.02.27_AM

Screen_Shot_2020-11-20_at_1.12.59_AM

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Erick Bajao

Merge request reports