Initial implementation
We need (and have in !1 (merged)):
- Feature parity with elastic_repo_indexer
- Programming language detection is more primitive
- Encoding detection / conversion s on par
- Need to check our behaviour with various repo edge cases.
index_blobs
in the ruby version has lots of code we don't have.
- Proof of performance improvements
- Indexing the
v4.4
repository takes 1/4 the time
- Indexing the
- Proof of reduced CPU, IO and memory pressure
- Memory usage is currently broadly comparable :/ - but RAM MiB-seconds is significantly reduced
- Need to check I/O patterns
- Need to check CPU usage
We have two choices of Git backend:
- https://github.com/src-d/go-git: https://gitlab.com/gitlab-org/es-git-go/merge_requests/1
- libgit2 bindings (git2go, like rugged): https://gitlab.com/gitlab-org/es-git-go/merge_requests/2
The latter is more difficult to build against, but a little faster. Since we're likely moving to gitaly to pull repo data at some point (#4 (closed)), it makes sense to use the first, but I'll keep !2 (closed) up to date.
We have two choices of encode-to-UTF8 support:
- github.com/saintfish/chardet is in !1 (merged) and is a pure-Go implementation of some of libicu.
- icu4c: !3 (merged)
Again, the latter needs cgo, but is a bit faster. It's also more universally available. The former is buggy and needs us to also pull in pure-go text conversion from golang.org/x/text. I don't like it much.
so, let's use !3 (merged)