Add additional preprocessing to the DF pipeline
This MR adds additional jobs to the DF preprocessing pipeline:
- cleaning copyrights. The job looks into the first 100 rows and 10 comments to identify copyrights by keywords, e.g., license
- cleaning comment decorations like
----
or===
often used by developers. - remove empty lines if there are 3 or more empty lines in a row
- filtering files that contain mostly hexadecimal values
Edited by Alexander Chueshev