Fetch GitLab project files and upload to BigQuery
This MR provides a basic Python script to fetch public GitLab project files and upload them to BigQuery.
Please, check the issue https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/175 to get the requirements used to select projects.
The final dataset consists of about 800K text files. Output table (private GCP project) - link
How to run
poetry install
poetry shell
poetry run promptlib/load_codebase.py
If you want to rerun the script and you want to avoid duplication, please change the output table defined in _PROJECT_FILE_BQ_TABLE
or put a bq_projects.csv
file in the project root folder with the following content:
bq_projects.csv The Python script uses the bq_projects.csv
file to build an index of already processed projects.
We can make the Python script generic with any follow-up MR. Please, note that this is the first step of building an evaluation prompt pipeline.
Closes https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/175