Skip to content

For files with .vcf extension, check file contents for 'true' VCF-ness.

Jules Kerssemakers requested to merge taste-vcf into master

Hi @vinjana ! After our chat today, I felt inspired to come back to my beloved project.

Here is the basic implementation to distinguish 'true' VCF from the DKFZ quasi-VCF.

  • if the file has VCF extension ..
  • .. additionally open the file to check for VCF-spec header.
    • if not found, it's a DKFZ quasi-VCF, skip it
    • If we can't read it, log the read error and skip it.

The file-opening is a violation of an old design-principle. Before, I always insisted on only using stat-level info (file pah, permissions, modification time, etc) to avoid excess filesystem operations.
This is the first time we're actually open-ing files in the datafolders, which probably has impact on performance, crawl duration and will be noisy for the file-access audit log.

Additionally, I'm not sure if this compiles and/or runs, since I don't have a Perl install with all the dependencies available, and don't have a test data set.

You should be able to run cd test; ./test-crawl.sh on the crawler-server (or locally, if you install the used deps). I haven't added a test case for VCF files yet 👼 😟

I probably won't have much time for this MR, with the baby and the new job, so please take this code and run with it 😄

Legal stuff: This contribution happens as a private person in my free time. I contribute this under the MIT License.

Merge request reports

Loading