Min chars per embedding?
Problem to solve
gitlab-docs ingestion script skips documents that are less than 100 chars, as specified by MIN_CHARS_PER_EMBEDDING here.
Maybe this is a dumb question. but for rewriting ingestion parser in Python (!874), I want to know what are consequences of removing this limitation?
Files with content under 100 chars.
They are missing from the generated JSONL file (available as ingest-job artifact).
$ jq .metadata.source < ../testdata/docs-v17.0.1-ee.jsonl | grep patterns/index.md
$ jq .metadata.source < ../testdata/docs-v17.0.1-ee.jsonl | grep science/index.md
$ echo $?
1
Files with content above 100 chars.
-
Check what Ruby parser does when content tail after last chunk is less than 100 chars.
Links / references
Edited by Anatoli Babenia