Min chars per embedding?

Problem to solve

gitlab-docs ingestion script skips documents that are less than 100 chars, as specified by MIN_CHARS_PER_EMBEDDING here.

Maybe this is a dumb question. but for rewriting ingestion parser in Python (!874), I want to know what are consequences of removing this limitation?

Files with content under 100 chars.

architecture/blueprints/database/scalability/patterns/index.md
development/data_science/index.md

They are missing from the generated JSONL file (available as ingest-job artifact).

$ jq .metadata.source < ../testdata/docs-v17.0.1-ee.jsonl | grep patterns/index.md
$ jq .metadata.source < ../testdata/docs-v17.0.1-ee.jsonl | grep science/index.md
$ echo $?
1

Files with content above 100 chars.

Check what Ruby parser does when content tail after last chunk is less than 100 chars.

Links / references

Edited Sep 19, 2024 by Anatoli Babenia