chore(chunker): extract sizer factory
What does this MR do and why?
- chore(chunker): extract sizer factory
Using a factory to build the sizers should allow us to amortize the tokenization cost for each file.
- chore(chunker): add benchmarks
These benchmarks are based on the corresponding benchmarks in gitlab-elasticsearch-indexer.
Related Issues
Testing
- Checkout this branch and build release version of gitlab-code-parser:
$ git checkout jf_chunker_optimisations $ cargo build --release - Copy library output to gitlab-elasticsearch-indexer
$ cp target/release/libparser_c_bindings.a ../gitlab-elasticsearch-indexer/tmp/libparser/lib/libparser_c_bindings.a - Remove old binary and build indexer:
$ cd ../gitlab-elasticsearch-indexer $ rm bin/gitlab-elasticsearch-indexer $ make
Default chunking strategy:
$ RUST_BACKTRACE=1 GITLAB_INDEXER_MODE=chunk GITLAB_INDEXER_DEBUG_LOGGING=1 time ./bin/gitlab-elasticsearch-indexer -adapter "elasticsearch" -connection '{"url": ["http://localhost:9200"]}' -options '{
"timeout": "30m",
"gitaly_batch_size": 1000,
"from_sha": "",
"to_sha": "",
"project_id": 19,
"partition_name": "gitlab_active_context_code",
"partition_number": 0,
"gitaly_config": {
"address": "unix:/Users/james/src/gitlab-org/gdk/praefect.socket",
"storage": "default",
"relative_path": "@hashed/94/00/9400f1b21cb527d7fa3d3eabba93557a18ebe7a2ca4e471cfe5e4c5b4ca7f767.git",
"project_path": "gitlab-duo/gitlab"
}
}'
...
112.94 real 81.94 user 2.50 sys
code_pre_bert chunking strategy:
$ RUST_BACKTRACE=1 GITLAB_INDEXER_MODE=chunk GITLAB_INDEXER_DEBUG_LOGGING=1 time ./bin/gitlab-elasticsearch-indexer -adapter "elasticsearch" -connection '{"url": ["http://localhost:9200"]}' -options '{
"timeout": "30m",
"chunk_size": 500,
"chunk_strategy": "code_pre_bert",
"gitaly_batch_size": 1000,
"from_sha": "",
"to_sha": "",
"project_id": 19,
"partition_name": "gitlab_active_context_code",
"partition_number": 0,
"gitaly_config": {
"address": "unix:/Users/james/src/gitlab-org/gdk/praefect.socket",
"storage": "default",
"relative_path": "@hashed/94/00/9400f1b21cb527d7fa3d3eabba93557a18ebe7a2ca4e471cfe5e4c5b4ca7f767.git",
"project_path": "gitlab-duo/gitlab"
}
}'
...
116.03 real 103.45 user 3.62 sys
gitlab-elasticsearch-indexer Benchmarks
goos: darwin
goarch: arm64
pkg: gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/internal/mode/chunk/chunker
cpu: Apple M4 Max
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
Chunker/size-16 10.91µ ± 1% 10.92µ ± 1% ~ (p=0.869 n=10)
Chunker/code_bytes-16 6.787m ± 1% 6.723m ± 0% -0.94% (p=0.000 n=10)
Chunker/code_pre_bert-16 17.36m ± 1% 12.45m ± 1% -28.29% (p=0.000 n=10)
geomean 1.087m 970.5µ -10.73%
Performance Analysis
-
This merge request does not introduce any performance regression. If a performance regression is expected, explain why.
Edited by James Fargher