chore(chunker): extract sizer factory

What does this MR do and why?

  • chore(chunker): extract sizer factory

Using a factory to build the sizers should allow us to amortize the tokenization cost for each file.

  • chore(chunker): add benchmarks

These benchmarks are based on the corresponding benchmarks in gitlab-elasticsearch-indexer.

Testing

  1. Checkout this branch and build release version of gitlab-code-parser:
    $ git checkout jf_chunker_optimisations
    $ cargo build --release
  2. Copy library output to gitlab-elasticsearch-indexer
    $ cp target/release/libparser_c_bindings.a ../gitlab-elasticsearch-indexer/tmp/libparser/lib/libparser_c_bindings.a
  3. Remove old binary and build indexer:
    $ cd ../gitlab-elasticsearch-indexer
    $ rm bin/gitlab-elasticsearch-indexer
    $ make

Default chunking strategy:

$ RUST_BACKTRACE=1 GITLAB_INDEXER_MODE=chunk GITLAB_INDEXER_DEBUG_LOGGING=1 time ./bin/gitlab-elasticsearch-indexer -adapter "elasticsearch" -connection '{"url": ["http://localhost:9200"]}' -options '{
  "timeout": "30m",
  "gitaly_batch_size": 1000,
  "from_sha": "",
  "to_sha": "",
  "project_id": 19,
  "partition_name": "gitlab_active_context_code",
  "partition_number": 0,
  "gitaly_config": {
    "address": "unix:/Users/james/src/gitlab-org/gdk/praefect.socket",
    "storage": "default",
    "relative_path": "@hashed/94/00/9400f1b21cb527d7fa3d3eabba93557a18ebe7a2ca4e471cfe5e4c5b4ca7f767.git",
    "project_path": "gitlab-duo/gitlab"
  }
}'
...
      112.94 real        81.94 user         2.50 sys

code_pre_bert chunking strategy:

$ RUST_BACKTRACE=1 GITLAB_INDEXER_MODE=chunk GITLAB_INDEXER_DEBUG_LOGGING=1 time ./bin/gitlab-elasticsearch-indexer -adapter "elasticsearch" -connection '{"url": ["http://localhost:9200"]}' -options '{
  "timeout": "30m",
  "chunk_size": 500,
  "chunk_strategy": "code_pre_bert",
  "gitaly_batch_size": 1000,
  "from_sha": "",
  "to_sha": "",
  "project_id": 19,
  "partition_name": "gitlab_active_context_code",
  "partition_number": 0,
  "gitaly_config": {
    "address": "unix:/Users/james/src/gitlab-org/gdk/praefect.socket",
    "storage": "default",
    "relative_path": "@hashed/94/00/9400f1b21cb527d7fa3d3eabba93557a18ebe7a2ca4e471cfe5e4c5b4ca7f767.git",
    "project_path": "gitlab-duo/gitlab"
  }
}'
...
      116.03 real       103.45 user         3.62 sys

gitlab-elasticsearch-indexer Benchmarks

goos: darwin
goarch: arm64
pkg: gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/internal/mode/chunk/chunker
cpu: Apple M4 Max
                         │ before.txt  │              after.txt              │
                         │   sec/op    │   sec/op     vs base                │
Chunker/size-16            10.91µ ± 1%   10.92µ ± 1%        ~ (p=0.869 n=10)
Chunker/code_bytes-16      6.787m ± 1%   6.723m ± 0%   -0.94% (p=0.000 n=10)
Chunker/code_pre_bert-16   17.36m ± 1%   12.45m ± 1%  -28.29% (p=0.000 n=10)
geomean                    1.087m        970.5µ       -10.73%

Performance Analysis

  • This merge request does not introduce any performance regression. If a performance regression is expected, explain why.
Edited by James Fargher

Merge request reports

Loading