Add FOHM reprodicibiblity post

parent b47a5132
---
title: Reproducibility aspects of the Swedish COVID-19 estimate report
layout: post
image:
path: /images/blog.jpg
hidden: true
---
**Researchers have called out for more transparency from [The Public Health Agency of Sweden](https://www.folkhalsomyndigheten.se/) regarding the COVID-19 estimates for Sweden. Recently, a report has been released covering such estimates for the Stockholm region. Along the report, the code used for these estimated was uploaded to Github, which makes it possible for others to review
and critique the work. In this post we will take a look at the reproducibility aspects of this release. We find that it is possible to some extent reproduce some of the figures in the report, and we suggest many improvements to the repository.**
<!--more-->
{% include toc %}
## Introduction
To strengthen and validate our scientific claims, _replication_ by a completely
independent study is important --- but this can be time-consuming and costly
and is hard in practice.
Researchers in Computer Science have therefore since the 1980s called for
methods and tools for making scientific work reproducible. The principle of
_Reproducible Research_ {% cite buckheit1995wavelab claerbout1992electronic
Artifact18:online %} is to make data and computer code available for others to
analyze and criticize.
Reproducible research is a minimum standard when full, independent, replication
of a study by independent researchers is not possible {% cite Peng2011 %}.
[The Public Health Agency of Sweden](https://www.folkhalsomyndigheten.se/)
released a report {% cite ThePublicHealthAgencyofSweden2020 %} on the 21th of April with
accompanying
[code on Github](https://github.com/FohmAnalys/SEIR-model-Stockholm) (committed on the 23rd). The fact that the
code is made available is of course very positive, however, we will will review
and evaluate this code from a reproducibility point of view. We will use the
requirements from {% cite Monperrus2018 %} and {% cite Leek2016 %} for this evaluation and use commit [`bb616e9`](https://github.com/FohmAnalys/SEIR-model-Stockholm/tree/bb616e970d6cb3f8b01edfe735b1a418480c6af2) (2020-04-24) of the repository.
It is important to note that we will not review or critique this code from a
health perspective. There will not be a single exponential curve in this post.
This is the structure of the repository:
{% highlight bash linenos %}
├── Data
│   ├── Data_2020-04-10Ny.txt
│   └── Sverige_population_2019.Rdata
├── LICENSE
├── README.md
├── Results
│   ├── Figures
│   │   ├── Incidence_number_infected_14Days_CI_non-reported_98.7perc_less_inf_100perc.pdf
│   │   ├── ...
│   └── Tables
│   ├── Raw_data_fitted_model_para_p_asymp0.9873infect0.11.txt
| ├── ...
└── Script
└── Estimate_SEIR_for_sharing_new_incidence.R
{% endhighlight %}
## Evaluation
### Findable and downloadable
The first requirement on a good data science repository is that it is findable
and downloadable {% cite Monperrus2018 %}.
The report {% cite ThePublicHealthAgencyofSweden2020 %} itself does not
contain a link to the code, but we find a link on another page of [The Public Health Agency of Sweden](https://www.folkhalsomyndigheten.se/smittskydd-beredskap/utbrott/aktuella-utbrott/covid-19/analys-och-prognoser/)
website. We didn't find the repository by Googling the name of the report, but
the repository includes the name and a link to the report.
### Version control and license
The code has been released on [Github](http://www.github.com) with a [GPUv2](https://github.com/FohmAnalys/SEIR-model-Stockholm/blob/master/LICENSE) license. We note that this seems to include results that are in the repository. Results that are generated by this code should not be covered by this license, and the report states that all figures are copyrighted and that a permission must be given by the copyright holder to publish them.
Best practices for licensing scientific code is a larger topic, and one that
could be debated so we leave that as further work.
The first commit is [verified](https://help.github.com/en/github/authenticating-to-github/about-commit-signature-verification), while the following commits are not. Verifying commits is a way to allow people to see that the content comes from a trusted source. The commits are from an individual without a [Github](http://www.github.com) account.
### Documented
There are no instructions on how to run this code in the `README.md` file. Folders are self-explanatory with names such as `Scripts` and `Data`.
### Exercisable
We can see that there is an R-script in the `Scripts` folder, so we will attempt to run this as is.
{% highlight bash linenos %}
Rscript Script/Estimate_SEIR_for_sharing_new_incidence.R
{% endhighlight %}
We get an error related to character encoding (`Error: invalid multibyte character in parser at line 16`), and it seems as if it is encoded with [ISO-8859](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) which we can check with
{% highlight bash linenos %}
file Script/Estimate_SEIR_for_sharing_new_incidence.R
{% endhighlight %}
We can easily save this in [UTF-8](https://en.wikipedia.org/wiki/UTF-8)
instead.
{% highlight bash linenos %}
iconv -f iso-8859-1 -t utf-8 < Script/Estimate_SEIR_for_sharing_new_incidence.R > Script/Estimate_SEIR_for_sharing_new_incidence_utf8.R
{% endhighlight %}
Since there is no documentation, we don't know what the required environment
is. In the code itself, there is a note that `R 3.5.2` has been used and the loaded packages are listed in one place. We
don't have a full description of the session or environment, so we
must ourselves figure out what versions of the dependencies were used. Luckily, there are not many dependencies and we can install them as follows.
{% highlight R linenos %}
install.packages(c("reshape2", "openxlsx", "RColorBrewer", "rootSolve","deSolve"))
{% endhighlight %}
Another alternative, a better one IMHO, is to use the
[`checkpoint` package](https://cran.r-project.org/web/packages/checkpoint/vignettes/checkpoint.html).
It allows one to set a checkpoint in time, so that
another use will use the packages and versions available at that time. We add the following to the top of the script.
{% highlight R linenos %}
install.packages("checkpoint")
library(checkpoint)
checkpoint("2017-04-22")
{% endhighlight %}
An even better alternative would have been to use, and make available, a
Docker image {% cite Merkel2014 Boettiger2014 %} that have R and all the dependencies installed.
In any case, it is often a good idea to include a `Makefile` so that someone can run `make` to the script.
### Input data
The data used to produce the result is included in the repository. The file `./Data/Data_2020-04-10Ny.txt` contains values up to 2020-04-10.
{% highlight csv linenos %}
"Datum" "Incidens"
2020-02-17 1
2020-02-18 0
2020-02-19 0
{% endhighlight %}
Following the requirements set by {% cite Leek2016 %} we see that we should
have received: the raw data, a tidy data set, a code book and a script to
translate the raw data to tidy data. Immediately we see that this repository
only contains a tidy dataset that was edited. There is a comment in the code
that says that the dataset differs from the reported case data in that the
imported cases were removed.
In normal cases, it is important that the rawest form of the data that the
researchers have access to is included in the repository. However, here it is
likely that this data contains privacy sensitive information so it is
understandable that this is not included. It would be good to at the very least
include a dataset with the reported and imported cases as two columns in the
same file. This would allow for someone to rerun the code with new data.
While there was no code-book included in the repository, the dataset is simple and the cleaning and the analysis seems to be well explained in the report.
### Complete
The repository is complete if all numbers and figures from the paper be re-computed from the code {% cite Monperrus2018 %}.
The script produces a long list of figures (12 of them) and tables and while the numbers have not been checked in detail, they seem to be good. The report on the other hand contains more figures that are not generated by the code.
To be completely reproducible, the code must generate the report.
### Durable
The specific commit used for the report should be archived and referenced from within the report. [Zenodo](https://zenodo.org/) makes it extremely simple to archive from a [Github](http://www.github.com) repository.
## Related work
[BenjaK](https://github.com/BenjaK/SEIR-model-Stockholm/) identified and corrected issues related to character encoding, packages as well as many other issues.
[consideRatio](https://github.com/consideRatio) submitted a [pull request](https://github.com/FohmAnalys/SEIR-model-Stockholm/pull/4) to make the code runnable on [mybinder.org](https://mybinder.org/). This means that the code is runnable in RStudio online - [try it out here](https://gke.mybinder.org/v2/gh/consideratio/SEIR-model-Stockholm-1/master?urlpath=%2Frstudio).
## Conclusion
It is very positive that this code was made available to the public to be
reviewed and critiqued by anyone. It had some issues, that have been corrected
by others, but not at the time of writing included in the original repository.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment