Skip to content

Expand the raw dataset by additionally checking the number of commits for each repo

Alexander Chueshev requested to merge expand-raw-dataset into main

This MR expands the raw dataset by introducing new heuristics: number of commits in the repo. We're trying to increase the size of our dataset with good source code quality (estimated empirically). After merging this MR, we're going to rely on the number of stars, watchers, and commits when selecting repos to include in the raw dataset.

The unreview-poc-390200e5.gl_code_suggestions.repo_contents_v2 table contains the results of running the updated SQL scripts.

Ref: ai-assist#22 (comment 1312171627)

Edited by Alexander Chueshev

Merge request reports