Misnamed .xnml files that are not named by their PMCID
I noticed that PMC3568465.nxml
is not in download/pmc-articles-xml.zip
.
Running:
zip_path = pathlib.Path('download/pmc-articles-xml.zip')
zip_file = zipfile.ZipFile(zip_path)
pmcid = 'PMC3568465'
root = utils.read_article(zip_file, pmcid + '.nxml')
results in
KeyError: "There is no item named 'PMC3568465.nxml' in the archive"
I got the ID PMC3568465
from data/article-cite-styles.tsv.gz
. So how did we process that article without PMC3568465.nxml
existing in the pmc-articles-xml.zip
? My guess is that this file is named differently in the archive. It's possible even that it's name is a completely different PMCID than what the XML article-meta says.
Edited by Daniel Himmelstein