Parse Git attribute files using Ruby (340e111e) · Commits · GitLab.org / gitlab_git

Verified Commit 340e111e authored Aug 30, 2016 by Yorick Peterse
Parse Git attribute files using Ruby

Rugged provides a way of parsing Git attribute files such as the one
located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring
tools quite a lot of time can be spent in parsing/retrieving attributes.

This commit introduces a pure Ruby parser for gitlab_git that performs
drastically better than the one provided by Rugged.

== Production Timings

As an example, take the commit nrclark/dummy_project@81ebdea5
(as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785).
When loading this commit we spend between 4 and 6 seconds in
Rugged::Repository#fetch_attributes. This method is called around 1100
times. This is the result of two problems:

1. For every diff we call Gitlab::Git::Repository#diffable? and pass it
   a blob. This method in turn returns a boolean (based on the Git
   attributes for the blob's path) indicating if the content is
   diffable.

2. For every diff we use the GitLab class Gitlab::Highlight which calls
   Repository#gitattribute in the #custom_language method. This is used
   to determine what language to use for highlighting a diff.

As a result in the worst case we'll end up with 2 calls to
Gitlab::Git::Repository#attributes (previously delegated to
Rugged::Repository#attributes).

== Rugged Implementation

Rugged in turn implements the "attributes" method in a rather
in-efficient way. The first time this method is called it will run at
least a single open() call to open the file. On top of that it appears
to run 2 stat() calls for every call to Rugged::Repository#attributes.
In other words, if you call it a 100 times you will end up with 201 IO
calls:

* 200 stat() calls
* 1 open() call

== Rugged IO Overhead

To confirm the IO overhead of Rugged I created the following script
(saved as "confirm.rb"):

    require 'rugged'

    path = '/tmp/test/.git'
    repo = Rugged::Repository.new(path)

    10.times do
      repo.attributes('README.md')['gitlab-language']
    end

I then ran this as follows:

    strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l

This counts the number of instances an IO call refers to the
"$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls
were executed.

While this may not be a big problem when using physical storage (even
less so when using SSDs), this _will_ be a problem when using network
storage. For example, say every operation takes 2 milliseconds to
complete. This would result in _at least_ 400 milliseconds being spent
in _just_ the IO operations.

The Ruby parser on the other hand only uses a single open() IO call.

== Benchmarking

To measure the performance of this code I wrote the following benchmark:

    require 'rugged'
    require 'benchmark/ips'

    require_relative 'lib/gitlab_git/attributes'

    repo = Rugged::Repository.new('/tmp/test/.git')
    attr = Gitlab::Git::Attributes.new(repo.path)

    Benchmark.ips(time: 10) do |bench|
      bench.report 'Rugged' do
        repo.attributes('test.haml.html')['gitlab-language']
      end

      bench.report 'gitlab_git' do
        attr.attributes('test.haml.html')['gitlab-language']
      end

      bench.compare!
    end

The contents of /tmp/test/.git/info/attributes are as follows:

    # This is a comment, it should be ignored.

    *.txt     text
    *.jpg     -text
    *.sh      eol=lf gitlab-language=shell
    *.haml.*  gitlab-language=haml
    foo/bar.* foo
    *.cgi     key=value?p1=v1&p2=v2

    # This uses a tab instead of spaces to ensure the parser also supports this.
    *.md	gitlab-language=markdown

Running this benchmark on my development environment produces the
following output:

    Warming up --------------------------------------
                  Rugged     9.543k i/100ms
              gitlab_git    43.277k i/100ms
    Calculating -------------------------------------
                  Rugged    100.261k (± 2.0%) i/s -      1.012M in  10.093380s
              gitlab_git    482.186k (± 1.7%) i/s -      4.847M in  10.055286s

    Comparison:
              gitlab_git:   482185.6 i/s
                  Rugged:   100260.6 i/s - 4.81x  slower

The exact output differs on system load but usually the new Ruby based
parser is between 4 and 6 times faster than Rugged.

To further test this I wrote the following benchmark:

    require 'benchmark'

    amount = 5000
    rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git')
    attrs = Gitlab::Git::Attributes.new(rugged.path)

    rugged = amount.times.map do
      timing = Benchmark.measure do
        rugged.attributes('README.md').to_h
      end

      timing.real * 1000.0
    end

    ruby = amount.times.map do
      timing = Benchmark.measure do
        attrs.attributes('README.md')
      end

      timing.real * 1000.0
    end

    puts "Rugged: #{rugged.inject(:+)} ms"
    puts "Ruby: #{ruby.inject(:+)} ms"

This script uses Rugged and the new attributes parser, parses the same
attributes file 5000 times, and then counts the total processing time.
Running this script on worker1 produced the following output:

    Rugged: 131.95287296548486 ms
    Ruby: 30.17003694549203 ms

Here the Ruby based solution is around ~4.5 times faster than Rugged.

== Further Improvements

GitLab may decide to at some point cache the parsed data structures in
for example Redis, which is now possible due to them being proper Ruby
data structures. Note that this is only really beneficial in cases where
Git attributes are requested for the same file path in different
requests. This also requires careful cache invalidation. For example, we
don't want to invalidate the entire cache when modifying some unrelated
file.

Because of the complexity involved it's best to leave this for later and
only implement it once we're certain it will actually be beneficial.
parent b205c79e
Hide whitespace changes
Inline Side-by-side
Douwe Maan @DouweM
mentioned in commit 62927165
· Aug 31, 2016

mentioned in commit 62927165

mentioned in commit 62927165677560d47c0a51ccddfb4cf23f966f47

Toggle commit list
Sean McGivern 🔴 @smcgivern
mentioned in issue gitaly#158 (closed)
· Jun 16, 2017

mentioned in issue gitaly#158 (closed)

mentioned in issue gitaly#158

Toggle commit list
Please register or to comment