Skip to content

Fetch GitLab project files and upload to BigQuery

Alexander Chueshev requested to merge upload-codebase-to-bq into main

This MR provides a basic Python script to fetch public GitLab project files and upload them to BigQuery.

Please, check the issue https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/175 to get the requirements used to select projects.

The final dataset consists of about 800K text files. Output table (private GCP project) - link

How to run

  1. poetry install
  2. poetry shell
  3. poetry run promptlib/load_codebase.py

If you want to rerun the script and you want to avoid duplication, please change the output table defined in _PROJECT_FILE_BQ_TABLE or put a bq_projects.csv file in the project root folder with the following content: bq_projects.csv The Python script uses the bq_projects.csv file to build an index of already processed projects.

We can make the Python script generic with any follow-up MR. Please, note that this is the first step of building an evaluation prompt pipeline.

Closes https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/175

Merge request reports