Skip to content

WIP: [PoC] Fix cobertura paths in the background job

Erick Bajao requested to merge eb-poc-cobertura-background-fix into master

This is the PoC work for #273540 (closed)

This is a proposal on how we could fix the cobertura report's file paths and also support multiple sources.

How it works

The parser will try to extract the right portion after the project's fullpath from each source node.

For example, if the project full_path is foo/bar

<sources>
  <source>/builds/foo/bar/MyLib1</source>
  <source>/builds/foo/bar/Some/foo/bar/here</source>
</sources>

We will extract ["MyLib1", "Some/foo/bar/here"] and use these as basis to detemine the actual file path of each class relative to the project root, combining extracted source and the class filename.

Let's say we have a class with the filename=User.cs, and the worktree paths look like ["MyLib1/User.cs", "Some/foo/bar/here/User.cs"], we will take the first path that matches which is Mylib1/User.cs.

You may also notice that we have to delete the matched class paths from the worktree context[:paths] after we finish parsing a package's collection of classes. This is best-effort assumption to avoid conflicts with class filenames that have duplicates across multiple packages and sources.

I added specs to make sure the parser is properly handling the edge cases in this PoC.

I got the coverage MR diffs to work locally on a C# and Java project.

What about Go support?

Given the Go issue seems to require its own special parser logic, for now it will fallback to ignoring sources and will require Go users to do the sed workaround. I propose that we have a separate issue for handling Go cobertura format.

Some benchmark results

pipeline.all_worktree_paths performance on GitLab project on staging

From testing on staging, loading all_worktree_paths on a large project like ours seems to have no performance implications.

More info

pipeline = Ci::Pipeline.find 12878474
Benchmark.bmbm do |x|
x.report("fetch gitlab worktree") { pipeline.all_worktree_paths }
end

Rehearsal ---------------------------------------------------------
fetch gitlab worktree   0.525779   0.034583   0.560362 (  1.108381)
------------------------------------------------ total: 0.560362sec
                            user     system      total        real
fetch gitlab worktree   0.000119   0.000032   0.000151 (  0.000132)

[ gstg ] production> s=pipeline.all_worktree_paths.to_set
[ gstg ] production> s.size
=> 39495

Cost of calling all_worktree_paths on a project with 500_001 files on staging


[ gstg ] production> Benchmark.bmbm {|b| b.report('500_001 files') { p.repository.ls_files('master') }; }
Rehearsal -------------------------------------------------
500_001 files   0.529791   0.124296   0.654087 (  0.716400)
---------------------------------------- total: 0.654087sec
                user     system      total        real

500_001 files   0.470483   0.028320   0.498803 (  0.630161)
[ gstg ] production> p.repository.ls_files('master').size
=> 500001

Comparing performance between old and new implementation

The added parser behavior in this proposal, which is cross checking between sources and worktree paths, will only be triggered when there are any extracted parts after the first project.full_path occurrence. So for current cobertura reports that don't have any issues, for example, RSpec, there shouldn't be any significant change on performance.

More info

xml_data = File.read("/Users/kratos/Documents/GitLab/test/cobertura/cobertura-coverage-rspec-gitlab.xml");
Benchmark.bmbm do |x|
x.report("old parser") { Gitlab::Ci::Parsers::Coverage::CoberturaOld.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new) }
x.report("new parser") { Gitlab::Ci::Parsers::Coverage::Cobertura.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new, project_path: 'gitlab-org/gitlab', worktree_paths: []) }
end
Rehearsal ----------------------------------------------
old parser   5.591941   1.136074   6.728015 (  6.756588)
new parser   5.537035   0.176511   5.713546 (  5.718334)
------------------------------------ total: 12.441561sec
             user     system      total        real

old parser   5.999207   0.148476   6.147683 (  6.149939)
new parser   6.331641   0.197576   6.529217 (  6.537377)
A bit weird here. The there seems to be more time taken when parsing the same XML file on the 2nd run (new parser). Just to prove it's not because of the added code in the parser:
[7] pry(main)> Benchmark.bmbm {|x| x.report('old') { Hash.from_xml(xml_data)}; x.report('new') { Hash.from_xml(xml_data)}; }
Rehearsal ---------------------------------------
old   5.123571   0.145755   5.269326 (  5.272794)
new   5.531147   0.186506   5.717653 (  5.719541)
----------------------------- total: 10.986979sec
      user     system      total        real

old   5.048480   0.142026   5.190506 (  5.192629)
new   6.089451   0.130987   6.220438 (  6.222050)

The possible cause for concern in this implementation is when there are too many sources, assuming all those sources have extractable parts, because we have to iterate through each source for each class node and cross-check the combined path to the given worktree paths.

Here's an example for a single source and 10_000 classes. As you will see, the performance difference between old and new is insignificant.

More info

Rehearsal ----------------------------------------------
old parser   0.574016   0.044187   0.618203 (  0.619036)
new parser   0.566651   0.022968   0.589619 (  0.589773)
------------------------------------- total: 1.207822sec
             user     system      total        real

old parser   0.382012   0.002439   0.384451 (  0.384550)
new parser   0.458621   0.002361   0.460982 (  0.461330)

Now here's an example for 100 extracted sources and 10_000 classes. This is where the parsing time is significantly slower with the new implementation. I am using 10_000 because this is close to the number of classes we currently have in our RSpec cobertura report which is around 8_000.

More info

Rehearsal ----------------------------------------------
old parser   0.497342   0.006429   0.503771 (  0.503547)
new parser   0.723245   0.011851   0.735096 (  0.734848)
------------------------------------- total: 1.238867sec
             user     system      total        real

old parser   0.352771   0.000223   0.352994 (  0.352829)
new parser   0.813122   0.018858   0.831980 (  0.832343)

This is why in this PoC, I am proposing to add a MAX_SOURCES limit on the number of extracted sources we can iterate through. Beyond the limit, if there is still no match for the given file path, the class will not be included in the report.

My final thoughts and opinions

My personal opinion about the sources concern is that we may not actually realistically get projects with that huge amount of sources, but having a limit on MAX_SOURCES will at least help, if ever.

I think the parsing logic that we do here (comparing between extracted sources and worktree paths) will have to be done the same if ever we decide to go the CLI tool route. But I am worried it might be less performant with the CLI tool because then we will have to be actually checking if the file exists, which is slower than Set#include?.

I also think that it's better that we put the burden of doing this extra work on our background jobs instead of consuming more user CI minutes.

Looking forward to hear your thoughts and suggestions.

Edited by Erick Bajao

Merge request reports