Commit 0a0ccb34 authored by Jozef Hajnala's avatar Jozef Hajnala

Initialize repo

parents
# Benchmarking reading data from csv files with base R, readr and data.table
## Getting the data
The data source is Airline on-time performance from http://stat-computing.org/dataexpo/2009/the-data.html.
The following will download and extract the data into `~/dataexpo`. The download size is cca. 868 MB in bz2 files. The extracted size is cca 5.34 GB in csv files.
```
Rscript data_prep/data_prep.R
```
## Running the benchmarks
### For base R
```
bash bench/bench.sh rscripts/01_base.R &> results/out_base.txt
```
### For data.table::fread
```
bash bench/bench.sh rscripts/02_fread.R &> results/out_fread.txt
```
### For readr::read_csv
```
bash bench/bench.sh rscripts/03_readr.R &> results/out_readr.txt
```
### For data.table::fread with grep
```
bash bench/bench.sh rscripts/04_fread_grep.R &> results/out_fread_grep.txt
```
## Overview of the results:
| method | max. memory | avg. time |
|-------------------------------------|------------:|----------:|
| `utils::read.csv` + `base::rbind` | 21.70 GB | 8.13 m |
| `readr::read_csv` + `purrr:map_dfr` | 27.02 GB | 3.43 m |
| `data.table::fread` + `rbindlist` | 15.25 GB | 1.40 m |
| `data.table::fread` from `grep` | 1.68 GB | 0.40 m |
- max. memory = Maximum resident set size
- avg. time = Average real time as measured by `time`
## SessionInfo
- R version 3.4.4 (2018-03-15)
- Platform: x86_64-pc-linux-gnu (64-bit)
- Running under: Ubuntu 18.04.1 LTS
used packages:
- data.table_1.12.2
- readr_1.3.1
- magrittr_1.5
- purrr_0.2.4
#!/bin/bash
scriptf=$1
printf "$scriptf \n\n"
/usr/bin/time -v Rscript $scriptf \
2>&1 >/dev/null | \
grep -E 'Maximum resident'
time for i in {1..10}; do Rscript $scriptf >/dev/null; done
destDir <- path.expand("~/dataexpo")
years <- 2000:2008
baseUrl <- "http://stat-computing.org/dataexpo/2009"
bz2Names <- file.path(destDir, paste0(years, ".csv.bz2"))
dlUrls <- file.path(baseUrl, paste0(years, ".csv.bz2"))
if (!dir.exists(destDir)) {
dir.create(destDir, recursive = TRUE)
}
# download files
mapply(download.file, dlUrls, bz2Names)
# extract
system(paste0(
"cd ", destDir, "; ",
"bzip2 -d -k ", paste(bz2Names, collapse = " ")
))
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
rscripts/01_base.R
Maximum resident set size (kbytes): 22758860
real 81m20.590s * 5
user 77m39.694s * 5
sys 3m40.893s * 5
\ No newline at end of file
rscripts/02_fread.R
Maximum resident set size (kbytes): 15994100
real 14m1.247s
user 13m13.518s
sys 0m47.707s
\ No newline at end of file
rscripts/04_fread_grep.R
Maximum resident set size (kbytes): 1763528
real 2m31.939s
user 3m22.879s
sys 0m24.448s
\ No newline at end of file
rscripts/03_readr.R
Maximum resident set size (kbytes): 28335232
real 34m15.088s
user 32m54.354s
sys 1m20.681s
dataDir <- path.expand("~/dataexpo")
dataFls <- dir(dataDir, pattern = "csv$", full.names = TRUE)
df <- do.call(rbind, lapply(dataFls, read.csv))
suppressPackageStartupMessages({
library(data.table)
})
dataDir <- path.expand("~/dataexpo")
dataFls <- dir(dataDir, pattern = "csv$", full.names = TRUE)
dt <- data.table::rbindlist(
lapply(dataFls, data.table::fread, showProgress = FALSE)
)
suppressPackageStartupMessages({
library(readr)
library(purrr)
library(magrittr)
})
dataDir <- path.expand("~/dataexpo")
dataFiles <- dir(dataDir, pattern = "csv$", full.names = TRUE)
# rbind_rows won't coerce, prefedine
col_types <- readr::cols(
.default = col_double(),
UniqueCarrier = col_character(),
TailNum = col_character(),
Origin = col_character(),
Dest = col_character(),
CancellationCode = col_character(),
CarrierDelay = col_double(),
WeatherDelay = col_double(),
NASDelay = col_double(),
SecurityDelay = col_double(),
LateAircraftDelay = col_double()
)
df <- dataFiles %>%
purrr::map_dfr(
readr::read_csv,
col_types = col_types,
progress = FALSE
)
suppressPackageStartupMessages({
library(data.table)
})
dataDir <- path.expand("~/dataexpo")
dataFiles <- dir(dataDir, pattern = "csv$", full.names = TRUE)
# All flights by American Airlines
command <- sprintf(
"grep --text ',AA,' %s",
paste(dataFiles, collapse = " ")
)
dt <- data.table::fread(cmd = command)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment