Commit 53ea6a98 authored by Daniel Himmelstein's avatar Daniel Himmelstein
Browse files

07.rephetio-stats.ipynb: analyze rephetio cite stats

parent afbcbe68
Pipeline #18701092 passed with stage
in 6 minutes and 38 seconds
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stats on references from Project Rephetio\n",
"\n",
"The Project Rephetio Manuscript is availabe from many places:\n",
"\n",
"+ [eLife](https://elifesciences.org/articles/26726#references) (version of record) DOI: 10.7554/eLife.26726\n",
"+ [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5640425/) (deposited by eLife) PMC5640425\n",
"+ [Manubot](http://git.dhimmel.com/rephetio-manuscript/) (source on [GitHub](https://github.com/dhimmel/rephetio-manuscript))\n",
"+ [Thinklab](https://think-lab.github.io/p/rephetio/report/) (defunct science discussion platform) DOI: 10.15363/thinklab.a7\n",
"+ [bioRxiv](https://www.biorxiv.org/content/early/2017/08/31/087619) (preprint) DOI: 10.1101/087619\n",
"\n",
"The Manubot, Thinklab, and bioRxiv versions use numeric-style citations. The Thinklab and bioRxiv versions are behind the eLife version of record. The Manubot version is what the eLife production version descends from."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import zipfile\n",
"\n",
"import lxml.html \n",
"import pandas\n",
"import requests\n",
"\n",
"import utils"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze references via Manubot metadata"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"241"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read references from the Manubot's CSL JSON items\n",
"url = 'https://github.com/dhimmel/rephetio-manuscript/raw/9f19eeae25984af78529af346d7f37f1578335b7/references.json'\n",
"refs = requests.get(url).json()\n",
"len(refs)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DOI</th>\n",
" <th>first_author</th>\n",
" <th>venue</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10.1001/archneur.61.8.1254</td>\n",
" <td>Mirsattari</td>\n",
" <td>Archives of Neurology</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10.1002/14651858.cd006103.pub7</td>\n",
" <td>Cahill</td>\n",
" <td>Cochrane Database of Systematic Reviews</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DOI first_author \\\n",
"0 10.1001/archneur.61.8.1254 Mirsattari \n",
"1 10.1002/14651858.cd006103.pub7 Cahill \n",
"\n",
" venue \n",
"0 Archives of Neurology \n",
"1 Cochrane Database of Systematic Reviews "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows = list()\n",
"for ref in refs:\n",
" row = dict()\n",
" row['first_author'] = ref['author'][0]['family']\n",
" for key in ['DOI']:\n",
" row[key] = ref.get(key)\n",
" row['venue'] = None\n",
" for key in 'container-title', 'container-title-short', 'publisher':\n",
" if ref.get(key):\n",
" row['venue'] = ref[key]\n",
" break\n",
" rows.append(row)\n",
"ref_df = pandas.DataFrame(rows)\n",
"ref_df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"93"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Limit to references where the first author is Himmelstein\n",
"himmel_df = ref_df.query(\"first_author == 'Himmelstein'\")\n",
"len(himmel_df)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ThinkLab 62\n",
"Zenodo 24\n",
"Figshare 5\n",
"Cold Spring Harbor Laboratory 1\n",
"PLOS Computational Biology 1\n",
"Name: venue, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Venues of self-cites\n",
"himmel_df.venue.value_counts(dropna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## From Manubot HTML output, calculate stats on numeric-style cites"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://github.com/dhimmel/rephetio-manuscript/raw/9f19eeae25984af78529af346d7f37f1578335b7/manuscript.html'\n",
"response = requests.get(url)\n",
"manubot_html = lxml.html.document_fromstring(response.text)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['1', '2', '3']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cites_numeric = list()\n",
"for href in manubot_html.findall('body//a[@href]'):\n",
" if not href.get('href').startswith('#ref-'):\n",
" continue\n",
" cites_numeric.append(href.text)\n",
"cites_numeric[:3]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"353"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This numer is lower than the PMC author cites due to omitted in-text cites via spans like [16–18]\n",
"len(cites_numeric)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"903"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Total number of characters devoted to in-text citation strings (numeric-style)\n",
"sum(map(len, cites_numeric))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.558073654390935"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Average characters per citation string (numeric-style)\n",
"sum(map(len, cites_numeric)) / len(cites_numeric)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze PMC XML author-style citatations"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Project Rephetio on PMC \n",
"pmcid = 'PMC5640425'"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"zip_path = 'download/pmc-articles-xml.zip'\n",
"with zipfile.ZipFile(zip_path) as zip_file:\n",
" root = utils.read_article(zip_file, pmcid + '.nxml')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['DiMasi et al., 2016', 'Reichert, 2003', 'Hay et al., 2014']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cites_author = root.xpath(\"/article/body//xref[@ref-type='bibr']\")\n",
"cites_author = [''.join(cite.itertext()) for cite in cites_author]\n",
"cites_author[:3]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"394"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(cites_author)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8542"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Total number of characters devoted to in-text citation strings (author-style)\n",
"sum(map(len, cites_author))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"21.68020304568528"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Average characters per citation string (author-style)\n",
"sum(map(len, cites_author)) / len(cites_author)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract years that use letter-suffix disambiguation"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['2015a',\n",
" '2015b',\n",
" '2015c',\n",
" '2015d',\n",
" '2015e',\n",
" '2015f',\n",
" '2015g',\n",
" '2015h',\n",
" '2015i',\n",
" '2015j',\n",
" '2015k',\n",
" '2015l',\n",
" '2015m',\n",
" '2015n',\n",
" '2015q',\n",
" '2015r',\n",
" '2015s',\n",
" '2015u',\n",
" '2015v',\n",
" '2015z',\n",
" '2016a',\n",
" '2016b',\n",
" '2016c',\n",
" '2016d',\n",
" '2016e',\n",
" '2016f',\n",
" '2016g',\n",
" '2016h',\n",
" '2016i',\n",
" '2016j',\n",
" '2016k',\n",
" '2016l',\n",
" '2016m',\n",
" '2016n',\n",
" '2016o',\n",
" '2016p',\n",
" '2016q',\n",
" '2016r',\n",
" '2016s',\n",
" '2016t',\n",
" '2016u',\n",
" '2016v',\n",
" '2016w',\n",
" '2017a',\n",
" '2017b',\n",
" '2017d']"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"years = set()\n",
"for cite in cites_author:\n",
" authors, year = cite.rsplit(', ', 1)\n",
" if any(x.isalpha() for x in year):\n",
" # Year contains a letter like \"2015a\", but not \"2015\"\n",
" years.add(year)\n",
"sorted(years)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:pmc-citation-styles]",
"language": "python",
"name": "conda-env-pmc-citation-styles-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
# Stats on references from Project Rephetio
The Project Rephetio Manuscript is availabe from many places:
+ [eLife](https://elifesciences.org/articles/26726#references) (version of record) DOI: 10.7554/eLife.26726
+ [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5640425/) (deposited by eLife) PMC5640425
+ [Manubot](http://git.dhimmel.com/rephetio-manuscript/) (source on [GitHub](https://github.com/dhimmel/rephetio-manuscript))
+ [Thinklab](https://think-lab.github.io/p/rephetio/report/) (defunct science discussion platform) DOI: 10.15363/thinklab.a7
+ [bioRxiv](https://www.biorxiv.org/content/early/2017/08/31/087619) (preprint) DOI: 10.1101/087619
The Manubot, Thinklab, and bioRxiv versions use numeric-style citations. The Thinklab and bioRxiv versions are behind the eLife version of record. The Manubot version is what the eLife production version descends from.
%% Cell type:code id: tags:
``` python
import zipfile
import lxml.html
import pandas
import requests
import utils
```
%% Cell type:markdown id: tags:
## Analyze references via Manubot metadata
%% Cell type:code id: tags:
``` python
# Read references from the Manubot's CSL JSON items
url = 'https://github.com/dhimmel/rephetio-manuscript/raw/9f19eeae25984af78529af346d7f37f1578335b7/references.json'
refs = requests.get(url).json()
len(refs)
```
%%%% Output: execute_result
241
%% Cell type:code id: tags:
``` python
rows = list()
for ref in refs:
row = dict()
row['first_author'] = ref['author'][0]['family']
for key in ['DOI']:
row[key] = ref.get(key)
row['venue'] = None
for key in 'container-title', 'container-title-short', 'publisher':
if ref.get(key):
row['venue'] = ref[key]
break
rows.append(row)
ref_df = pandas.DataFrame(rows)
ref_df.head(2)
```
%%%% Output: execute_result
DOI first_author \
0 10.1001/archneur.61.8.1254 Mirsattari
1 10.1002/14651858.cd006103.pub7 Cahill
venue
0 Archives of Neurology
1 Cochrane Database of Systematic Reviews
%% Cell type:code id: tags:
``` python
# Limit to references where the first author is Himmelstein
himmel_df = ref_df.query("first_author == 'Himmelstein'")
len(himmel_df)
```
%%%% Output: execute_result
93
%% Cell type:code id: tags:
``` python
# Venues of self-cites
himmel_df.venue.value_counts(dropna=False)
```
%%%% Output: execute_result
ThinkLab 62
Zenodo 24
Figshare 5
Cold Spring Harbor Laboratory 1
PLOS Computational Biology 1
Name: venue, dtype: int64
%% Cell type:markdown id: tags:
## From Manubot HTML output, calculate stats on numeric-style cites
%% Cell type:code id: tags:
``` python
url = 'https://github.com/dhimmel/rephetio-manuscript/raw/9f19eeae25984af78529af346d7f37f1578335b7/manuscript.html'
response = requests.get(url)
manubot_html = lxml.html.document_fromstring(response.text)
```
%% Cell type:code id: tags:
``` python
cites_numeric = list()
for href in manubot_html.findall('body//a[@href]'):
if not href.get('href').startswith('#ref-'):
continue
cites_numeric.append(href.text)
cites_numeric[:3]
```
%%%% Output: execute_result
['1', '2', '3']
%% Cell type:code id: tags:
``` python
# This numer is lower than the PMC author cites due to omitted in-text cites via spans like [16–18]
len(cites_numeric)
```
%%%% Output: execute_result
353
%% Cell type:code id: tags:
``` python
# Total number of characters devoted to in-text citation strings (numeric-style)
sum(map(len, cites_numeric))
```
%%%% Output: execute_result
903
%% Cell type:code id: tags:
``` python
# Average characters per citation string (numeric-style)
sum(map(len, cites_numeric)) / len(cites_numeric)
```
%%%% Output: execute_result
2.558073654390935
%% Cell type:markdown id: tags:
## Analyze PMC XML author-style citatations
%% Cell type:code id: tags:
``` python
# Project Rephetio on PMC
pmcid = 'PMC5640425'
```
%% Cell type:code id: tags:
``` python
zip_path = 'download/pmc-articles-xml.zip'
with zipfile.ZipFile(zip_path) as zip_file:
root = utils.read_article(zip_file, pmcid + '.nxml')
```
%% Cell type:code id: tags:
``` python
cites_author = root.xpath("/article/body//xref[@ref-type='bibr']")
cites_author = [''.join(cite.itertext()) for cite in cites_author]
cites_author[:3]
```
%%%% Output: execute_result