Commit 3cddd147 authored by Jozef Hajnala's avatar Jozef Hajnala

Add grep version with readr

parent 0acfe167
......@@ -38,17 +38,24 @@ bash bench/bench.sh rscripts/03_readr.R &> results/out_readr.txt
bash bench/bench.sh rscripts/04_fread_grep.R &> results/out_fread_grep.txt
```
### For readr::read_csv with grep
```
bash bench/bench.sh rscripts/05_readr_grep.R &> results/out_readr_grep.txt
```
## Overview of the results:
| method | max. memory | avg. time |
|-------------------------------------|------------:|----------:|
| `utils::read.csv` + `base::rbind` | 21.70 GB | 8.13 m |
| `readr::read_csv` + `purrr:map_dfr` | 27.02 GB | 3.43 m |
| `data.table::fread` + `rbindlist` | 15.25 GB | 1.40 m |
| `data.table::fread` from `grep` | 1.68 GB | 0.40 m |
| method | max. memory | avg. time |
|----------------------------------------|------------:|----------:|
| `utils::read.csv` + `base::rbind` | 21.70 GB | 8.13 m |
| `readr::read_csv` + `purrr:map_dfr` | 27.02 GB | 3.43 m |
| `data.table::fread` + `rbindlist` | 15.25 GB | 1.40 m |
| `data.table::fread` from `grep` | 1.68 GB | 0.34 m |
| `readr::read_csv`+ `pipe()` from `grep`| 1.70 GB | 0.88 m |
- max. memory = Maximum resident set size
- avg. time = Average real time as measured by `time`
- avg. time = Average maximum of real time and user time as measured by `time`
## SessionInfo
......
rscripts/05_readr_grep.R
Maximum resident set size (kbytes): 1786616
real 8m47.001s
user 8m39.124s
sys 0m37.470s
\ No newline at end of file
suppressPackageStartupMessages({
library(readr)
})
dataDir <- path.expand("~/dataexpo")
dataFiles <- dir(dataDir, pattern = "csv$", full.names = TRUE)
# All flights by American Airlines
command <- sprintf(
"grep --text ',AA,' %s",
paste(dataFiles, collapse = " ")
)
# default would convert first row of `grep` output into column names
col_names <- FALSE
# default would determine some cols wrongly as logical and
# convert all the values, pre-define explicitly
col_types <- readr::cols(
col_character(), # this is for the file name returned by `grep`
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_character(),
col_integer(),
col_character(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_character(),
col_character(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_character(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer(),
col_integer()
)
df <- readr::read_csv(
file = pipe(command),
col_types = col_types,
col_names = col_names
)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment