use libmagic for filetype detection
We currently rely solely on the file extension which is prone to false positives and negatives. We should try to switch to a libmagic binding (there's a few I think, this one seems to be actively being maintained: https://github.com/ahupp/python-magic).
We'll have to see how the performance for big source trees would be, I think libmagic is rather fast, but I'm not sure how fast we can check something like 500k files.