Add additional preprocessing to the DF pipeline (!7) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Development

Alexander Chueshev requested to merge preprocess-dataset-v2 into main Mar 14, 2023

This MR adds additional jobs to the DF preprocessing pipeline:

cleaning copyrights. The job looks into the first 100 rows and 10 comments to identify copyrights by keywords, e.g., license
cleaning comment decorations like ---- or === often used by developers.
remove empty lines if there are 3 or more empty lines in a row
filtering files that contain mostly hexadecimal values

Edited Mar 14, 2023 by Alexander Chueshev

Add additional preprocessing to the DF pipeline