Skip to content

Add additional preprocessing to the DF pipeline

Alexander Chueshev requested to merge preprocess-dataset-v2 into main

This MR adds additional jobs to the DF preprocessing pipeline:

  • cleaning copyrights. The job looks into the first 100 rows and 10 comments to identify copyrights by keywords, e.g., license
  • cleaning comment decorations like ---- or === often used by developers.
  • remove empty lines if there are 3 or more empty lines in a row
  • filtering files that contain mostly hexadecimal values

Ref: ai-assist#22 (closed)

Edited by Alexander Chueshev

Merge request reports

Loading