Skip to content

feat(ingest): Rewrite Ruby code in Python

  • Please check this box if this contribution uses AI-generated content (including content generated by GitLab Duo features) as outlined in the GitLab DCO & CLA

What does this merge request do and why?

Rewrite Ruby docs ingestion in Python (#447).

How to set up and validate locally

The validation is done in CI. Both Ruby output and Python output are produced and can be compared side by side. To validate locally:

  1. Build the image with scripts and dependencies:
cd scripts/ingest
podman build ../.. -f Dockerfile -t gitlab-ingest
  1. Run
podman run -it --rm --env-file testparse.env gitlab-ingest scripts/ingest/gitlab-docs/test_parse.sh

Questions for maintainers

The main question is why embeddings have min size? Because of this min size, some content is stripped. For me cutting tails is a critical information loss. So why RAG content splitter does it? #562

I mark this as ready, because over months I was unable to solve the Ruby chunking logic on my own. So hopefully somebody can continue from here.

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Anatoli Babenia

Merge request reports

Loading