Skip to content

Fetch GitLab project files and upload to BigQuery

Alexander Chueshev requested to merge upload-codebase-to-bq into main

This MR provides a basic Python script to fetch public GitLab project files and upload them to BigQuery.

Please, check the issue to get the requirements used to select projects.

The final dataset consists of about 800K text files. Output table (private GCP project) - link

How to run

  1. poetry install
  2. poetry shell
  3. poetry run promptlib/

If you want to rerun the script and you want to avoid duplication, please change the output table defined in _PROJECT_FILE_BQ_TABLE or put a bq_projects.csv file in the project root folder with the following content: bq_projects.csv The Python script uses the bq_projects.csv file to build an index of already processed projects.

We can make the Python script generic with any follow-up MR. Please, note that this is the first step of building an evaluation prompt pipeline.


Merge request reports