Skip to content

Initial implementation

We need (and have in !1 (merged)):

  • Feature parity with elastic_repo_indexer
    • Programming language detection is more primitive
    • Encoding detection / conversion s on par
    • Need to check our behaviour with various repo edge cases. index_blobs in the ruby version has lots of code we don't have.
  • Proof of performance improvements
    • Indexing the v4.4 repository takes 1/4 the time
  • Proof of reduced CPU, IO and memory pressure
    • Memory usage is currently broadly comparable :/ - but RAM MiB-seconds is significantly reduced
    • Need to check I/O patterns
    • Need to check CPU usage

We have two choices of Git backend:

The latter is more difficult to build against, but a little faster. Since we're likely moving to gitaly to pull repo data at some point (#4 (closed)), it makes sense to use the first, but I'll keep !2 (closed) up to date.

We have two choices of encode-to-UTF8 support:

  • github.com/saintfish/chardet is in !1 (merged) and is a pure-Go implementation of some of libicu.
  • icu4c: !3 (merged)

Again, the latter needs cgo, but is a bit faster. It's also more universally available. The former is buggy and needs us to also pull in pure-go text conversion from golang.org/x/text. I don't like it much.

so, let's use !3 (merged)