Verified Commit 96054f3b authored by Frie's avatar Frie
Browse files

rm renv

parent db65df6c
https://colorhunt.co/palette/220248
<div>
<a href="mailto:fr1e@pm.me" title="Email me">
<span class="fa-stack fa-lg">
<i class="fas fa-envelope fa-stack-1x fa-inverse"></i>
</span>
</a>
<a href="https://gitlab.com/friep" title="GitLab">
<span class="fa-stack fa-lg">
<i class="fab fa-gitlab fa-stack-1x fa-inverse"></i>
</span>
</a>
<a href="https://github.com/friep" title="GitHub">
<span class="fa-stack fa-lg">
<i class="fab fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
<a href="https://twitter.com/ameisen_strasse" title="Twitter">
<span class="fa-stack fa-lg">
<i class="fab fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
<a href="https://linkedin.com/in/friedrike-preu-a2bb46a7/" title="LinkedIn">
<span class="fa-stack fa-lg">
<i class="fab fa-linkedin fa-stack-1x fa-inverse"></i>
</span>
</a>
<a href="https://keybase.io/friep" title="Keybase">
<span class="fa-stack fa-lg">
<i class="fas fa-key fa-stack-1x fa-inverse"></i>
</span>
</a>
</div>
\ No newline at end of file
---
title: "Automate the boring stuff of your amateur photographer life"
description: |
Being able to program makes you lazy - or rather it gives you the ability to be lazy by just automating everything. This is what I did in this post.
date: 2019-10-19
preview: before_preview.png
output:
distill::distill_article:
self_contained: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
knitr::opts_chunk$set(include = TRUE)
knitr::opts_chunk$set(echo = TRUE)
```
Recently, I went on holidays to the Vosges mountains in northeastern France. While one or two days were definitely too rainy to take electronics outside, I was able to take some pics with my Micro-Four-Thirds (MFT) camera of the beautiful autumn landscape, of our family dog (Team #rdogs!) and of the many, many fly agarics.
![](fliegenpilz.jpg)
Back home and with a free weekend all to myself, I ventured to sort the photos and sent the best ones to my family + friends who were with me on the trip. This is always my least favorite part because I take a lot of pictures and a lot of them are...well...not worthy the time of looking at.
So I opened the photo viewer on my Linux laptop, went through the photos and deleted the ones I don't like. "Done", you'd think. Well, no. Why? Because, some months ago, I decided I really *needed* to have RAW files - just in case I'd ever want to seriously edit something (spoiler: I'm too lazy for that). Soo, whenever I push the shutter button nowadays, two files with the same name are stored on my SD card: a normal `JPG` file and a RAW file with the `RW2` extension. So, for example `P1120006.JPG` and `P1120006.RW2`.
However, the Linux photo viewer only shows me the `JPG` files. So after an hour of deleting `JPG`s, I still needed to delete the corresponding `RW2` files of the JPGs I had deleted. And my dislike for doing stuff in the explorer / finder was big enough that I decided to automate this. Because the offending files are already deleted I set up a little test case for this post but I'll include some screenshots that will show how much time - and nerves - I saved from this little R exercise.
## Step 1: Get the data
First up is actually getting the file paths. For this, I use the good old `list.files` command which will give you all files in a given folder. I get both the simple path and the full path to the file.^[The double call could be avoided by splitting the full path using something like `tidyr::separate` but I was lazy.]
```{r, message=FALSE}
# delete RAW files where the jpg is deleted
library(dplyr)
library(stringr)
library(tidyr)
library(tibble)
library(here)
# FOLDER <- "/home/frie/Pictures/2019/2019-10_vogesen/"
FOLDER <- "data"
full_paths <- list.files(FOLDER, full.names = TRUE)
file_names <- list.files(FOLDER)
df <- tibble::tibble(full_path = full_paths, file_name = file_names)
df %>% select(-full_path)
```
There are `r nrow(df)` files in the folder. By manually looking at the data, I can easily see that I want to delete `P1120001.RW2` and `P1120008.RW2`.
In the real case, there were 942 `r emo::ji("scream")`. No way to easily see that at one glance!
![](before.png)
## Step 2: Determine which files need to be deleted
Fortunately, the `RW2` and `JPG` version have the same file name, except for the extension. I first extract this "common" element of the file name using `tidyr::separate` which splits a character vector at a certain pattern (the `sep` argument) and directly puts the splitted things into new columns (hard to explain `r emo::ji("smile")`, just see the result and compare with before!). This is honestly one of my favorite functions ever because it's such a common task that would be otherwise really annoying. ^[Sidenote: There's also `tidyr::separate_rows` which is even more awesome!]
```{r}
df <- df %>%
tidyr::separate(file_name, into = c("file_name_without_ext", "ext"), sep = "\\.")
df
```
Now I count how many files exist for each `file_name_without_ext` by grouping by that variable and counting the number of rows using the little magic `n()` function from dplyr. This is such a common pattern and I love that dplyr makes this so easy - I remember doing this for my Bachelor thesis without the *tidyverse* and it was soo difficult for me.
```{r}
# could be replaced by shorthand: dplyr::add_count(file_name_without_ext)
df <- df %>%
dplyr::group_by(file_name_without_ext) %>%
dplyr::mutate(n = n())
df
```
Now I filter those rows where `n == 1` - those are the `RW2` files that are the leftover companions of the `JPG`s I deleted manually. Just to be sure, I also add the `ext == "RW2"` condition to the filter statement.^[If I did my manual deletion process how I described it, this should not be necessary as a JPG should always have a "partner" RAW file. But who knows? `r emo::ji("shrug")`]
```{r}
delete_df <- df %>%
dplyr::filter(n == 1 & ext == "RW2")
nrow(delete_df) # only 2 files left
```
## Step 3: delete, delete, delete!
I use `dplyr::pull` to get the `full_path` variable from the data frame.^[`pull` is just like `$` - it just integrates better into pipe workflows. As I broke up the pipe for "educational" purposes, it does not really make sense here but I thought I left it in just in case someone did not know about it yet.] I also add a small check that I indeed have only `RW2` files - all this making sure thing is getting a bit out of hand but better safe than sorry. `r emo::ji("wink")`
And finally: delete, delete, delete that sh*t with `file.remove`!
```{r}
delete_paths <- delete_df %>%
dplyr::pull(full_path)
print(delete_paths)
# some quick check
# don't delete JPG
stopifnot(all(stringr::str_ends(delete_paths, "RW2")))
stopifnot(length(delete_paths) == 2)
# delete!
# i commented this out to make it easier to reproduce this.
# file.remove(delete_paths)
```
<div style="width:100%;height:0;padding-bottom:60%;position:relative;"><iframe src="https://giphy.com/embed/vohOR29F78sGk" width="100%" height="100%" style="position:absolute" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div><p><a href="https://giphy.com/gifs/black-and-white-jim-carrey-edited-vohOR29F78sGk">via GIPHY</a></p>
This deletes the two files that do not have a `JPG` companion. In the real use, my script successfully deleted *`r 942 - 684`* files as can be seen by comparing the before (posted at the beginning of this post) and after screenshots of my explorer.
![](after.png)
Hurray for the power of computers! `r emo::ji("tada")`
## The end
I don't know whether this brought any considerable insight to anyone. `r emo::ji("smile")` After all, this is not the usual use case for R - a well written shell command would've achieved the same. Or... actually manually deleting the files... But no, this was never an alternative.
Take away from this? Being able to program makes you lazy - or rather it gives you the ability to be lazy by just automating everything away. `r emo::ji("sunglasses")` `r emo::ji("tongue")` And in my opinion, this is just another excellent reason to: keep coding. `r emo::ji("heart")`
\ No newline at end of file
// @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt Expat
//
// AnchorJS - v4.2.2 - 2019-11-14
// https://www.bryanbraun.com/anchorjs/
// Copyright (c) 2019 Bryan Braun; Licensed MIT
//
// @license magnet:?xt=urn:btih:d3d9a9a6595521f9666a5e94cc830dab83b65699&dn=expat.txt Expat
!function(A,e){"use strict";"function"==typeof define&&define.amd?define([],e):"object"==typeof module&&module.exports?module.exports=e():(A.AnchorJS=e(),A.anchors=new A.AnchorJS)}(this,function(){"use strict";return function(A){function f(A){A.icon=A.hasOwnProperty("icon")?A.icon:"",A.visible=A.hasOwnProperty("visible")?A.visible:"hover",A.placement=A.hasOwnProperty("placement")?A.placement:"right",A.ariaLabel=A.hasOwnProperty("ariaLabel")?A.ariaLabel:"Anchor",A.class=A.hasOwnProperty("class")?A.class:"",A.base=A.hasOwnProperty("base")?A.base:"",A.truncate=A.hasOwnProperty("truncate")?Math.floor(A.truncate):64,A.titleText=A.hasOwnProperty("titleText")?A.titleText:""}function p(A){var e;if("string"==typeof A||A instanceof String)e=[].slice.call(document.querySelectorAll(A));else{if(!(Array.isArray(A)||A instanceof NodeList))throw new Error("The selector provided to AnchorJS was invalid.");e=[].slice.call(A)}return e}this.options=A||{},this.elements=[],f(this.options),this.isTouchDevice=function(){return!!("ontouchstart"in window||window.DocumentTouch&&document instanceof DocumentTouch)},this.add=function(A){var e,t,i,n,o,s,a,r,c,h,l,u,d=[];if(f(this.options),"touch"===(l=this.options.visible)&&(l=this.isTouchDevice()?"always":"hover"),0===(e=p(A=A||"h2, h3, h4, h5, h6")).length)return this;for(!function(){if(null!==document.head.querySelector("style.anchorjs"))return;var A,e=document.createElement("style");e.className="anchorjs",e.appendChild(document.createTextNode("")),void 0===(A=document.head.querySelector('[rel="stylesheet"], style'))?document.head.appendChild(e):document.head.insertBefore(e,A);e.sheet.insertRule(" .anchorjs-link { opacity: 0; text-decoration: none; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; }",e.sheet.cssRules.length),e.sheet.insertRule(" *:hover > .anchorjs-link, .anchorjs-link:focus { opacity: 1; }",e.sheet.cssRules.length),e.sheet.insertRule(" [data-anchorjs-icon]::after { content: attr(data-anchorjs-icon); }",e.sheet.cssRules.length),e.sheet.insertRule(' @font-face { font-family: "anchorjs-icons"; src: url(data:n/a;base64,AAEAAAALAIAAAwAwT1MvMg8yG2cAAAE4AAAAYGNtYXDp3gC3AAABpAAAAExnYXNwAAAAEAAAA9wAAAAIZ2x5ZlQCcfwAAAH4AAABCGhlYWQHFvHyAAAAvAAAADZoaGVhBnACFwAAAPQAAAAkaG10eASAADEAAAGYAAAADGxvY2EACACEAAAB8AAAAAhtYXhwAAYAVwAAARgAAAAgbmFtZQGOH9cAAAMAAAAAunBvc3QAAwAAAAADvAAAACAAAQAAAAEAAHzE2p9fDzz1AAkEAAAAAADRecUWAAAAANQA6R8AAAAAAoACwAAAAAgAAgAAAAAAAAABAAADwP/AAAACgAAA/9MCrQABAAAAAAAAAAAAAAAAAAAAAwABAAAAAwBVAAIAAAAAAAIAAAAAAAAAAAAAAAAAAAAAAAMCQAGQAAUAAAKZAswAAACPApkCzAAAAesAMwEJAAAAAAAAAAAAAAAAAAAAARAAAAAAAAAAAAAAAAAAAAAAQAAg//0DwP/AAEADwABAAAAAAQAAAAAAAAAAAAAAIAAAAAAAAAIAAAACgAAxAAAAAwAAAAMAAAAcAAEAAwAAABwAAwABAAAAHAAEADAAAAAIAAgAAgAAACDpy//9//8AAAAg6cv//f///+EWNwADAAEAAAAAAAAAAAAAAAAACACEAAEAAAAAAAAAAAAAAAAxAAACAAQARAKAAsAAKwBUAAABIiYnJjQ3NzY2MzIWFxYUBwcGIicmNDc3NjQnJiYjIgYHBwYUFxYUBwYGIwciJicmNDc3NjIXFhQHBwYUFxYWMzI2Nzc2NCcmNDc2MhcWFAcHBgYjARQGDAUtLXoWOR8fORYtLTgKGwoKCjgaGg0gEhIgDXoaGgkJBQwHdR85Fi0tOAobCgoKOBoaDSASEiANehoaCQkKGwotLXoWOR8BMwUFLYEuehYXFxYugC44CQkKGwo4GkoaDQ0NDXoaShoKGwoFBe8XFi6ALjgJCQobCjgaShoNDQ0NehpKGgobCgoKLYEuehYXAAAADACWAAEAAAAAAAEACAAAAAEAAAAAAAIAAwAIAAEAAAAAAAMACAAAAAEAAAAAAAQACAAAAAEAAAAAAAUAAQALAAEAAAAAAAYACAAAAAMAAQQJAAEAEAAMAAMAAQQJAAIABgAcAAMAAQQJAAMAEAAMAAMAAQQJAAQAEAAMAAMAAQQJAAUAAgAiAAMAAQQJAAYAEAAMYW5jaG9yanM0MDBAAGEAbgBjAGgAbwByAGoAcwA0ADAAMABAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAH//wAP) format("truetype"); }',e.sheet.cssRules.length)}(),t=document.querySelectorAll("[id]"),i=[].map.call(t,function(A){return A.id}),o=0;o<e.length;o++)if(this.hasAnchorJSLink(e[o]))d.push(o);else{if(e[o].hasAttribute("id"))n=e[o].getAttribute("id");else if(e[o].hasAttribute("data-anchor-id"))n=e[o].getAttribute("data-anchor-id");else{for(c=r=this.urlify(e[o].textContent),a=0;void 0!==s&&(c=r+"-"+a),a+=1,-1!==(s=i.indexOf(c)););s=void 0,i.push(c),e[o].setAttribute("id",c),n=c}(h=document.createElement("a")).className="anchorjs-link "+this.options.class,h.setAttribute("aria-label",this.options.ariaLabel),h.setAttribute("data-anchorjs-icon",this.options.icon),this.options.titleText&&(h.title=this.options.titleText),u=document.querySelector("base")?window.location.pathname+window.location.search:"",u=this.options.base||u,h.href=u+"#"+n,"always"===l&&(h.style.opacity="1"),""===this.options.icon&&(h.style.font="1em/1 anchorjs-icons","left"===this.options.placement&&(h.style.lineHeight="inherit")),"left"===this.options.placement?(h.style.position="absolute",h.style.marginLeft="-1em",h.style.paddingRight="0.5em",e[o].insertBefore(h,e[o].firstChild)):(h.style.paddingLeft="0.375em",e[o].appendChild(h))}for(o=0;o<d.length;o++)e.splice(d[o]-o,1);return this.elements=this.elements.concat(e),this},this.remove=function(A){for(var e,t,i=p(A),n=0;n<i.length;n++)(t=i[n].querySelector(".anchorjs-link"))&&(-1!==(e=this.elements.indexOf(i[n]))&&this.elements.splice(e,1),i[n].removeChild(t));return this},this.removeAll=function(){this.remove(this.elements)},this.urlify=function(A){return this.options.truncate||f(this.options),A.trim().replace(/\'/gi,"").replace(/[& +$,:;=?@"#{}|^~[`%!'<>\]\.\/\(\)\*\\\n\t\b\v]/g,"-").replace(/-{2,}/g,"-").substring(0,this.options.truncate).replace(/^-+|-+$/gm,"").toLowerCase()},this.hasAnchorJSLink=function(A){var e=A.firstChild&&-1<(" "+A.firstChild.className+" ").indexOf(" anchorjs-link "),t=A.lastChild&&-1<(" "+A.lastChild.className+" ").indexOf(" anchorjs-link ");return e||t||!1}}});
// @license-end
\ No newline at end of file
/*!
* Bowser - a browser detector
* https://github.com/ded/bowser
* MIT License | (c) Dustin Diaz 2015
*/
!function(e,t,n){typeof module!="undefined"&&module.exports?module.exports=n():typeof define=="function"&&define.amd?define(t,n):e[t]=n()}(this,"bowser",function(){function t(t){function n(e){var n=t.match(e);return n&&n.length>1&&n[1]||""}function r(e){var n=t.match(e);return n&&n.length>1&&n[2]||""}function N(e){switch(e){case"NT":return"NT";case"XP":return"XP";case"NT 5.0":return"2000";case"NT 5.1":return"XP";case"NT 5.2":return"2003";case"NT 6.0":return"Vista";case"NT 6.1":return"7";case"NT 6.2":return"8";case"NT 6.3":return"8.1";case"NT 10.0":return"10";default:return undefined}}var i=n(/(ipod|iphone|ipad)/i).toLowerCase(),s=/like android/i.test(t),o=!s&&/android/i.test(t),u=/nexus\s*[0-6]\s*/i.test(t),a=!u&&/nexus\s*[0-9]+/i.test(t),f=/CrOS/.test(t),l=/silk/i.test(t),c=/sailfish/i.test(t),h=/tizen/i.test(t),p=/(web|hpw)os/i.test(t),d=/windows phone/i.test(t),v=/SamsungBrowser/i.test(t),m=!d&&/windows/i.test(t),g=!i&&!l&&/macintosh/i.test(t),y=!o&&!c&&!h&&!p&&/linux/i.test(t),b=r(/edg([ea]|ios)\/(\d+(\.\d+)?)/i),w=n(/version\/(\d+(\.\d+)?)/i),E=/tablet/i.test(t)&&!/tablet pc/i.test(t),S=!E&&/[^-]mobi/i.test(t),x=/xbox/i.test(t),T;/opera/i.test(t)?T={name:"Opera",opera:e,version:w||n(/(?:opera|opr|opios)[\s\/](\d+(\.\d+)?)/i)}:/opr\/|opios/i.test(t)?T={name:"Opera",opera:e,version:n(/(?:opr|opios)[\s\/](\d+(\.\d+)?)/i)||w}:/SamsungBrowser/i.test(t)?T={name:"Samsung Internet for Android",samsungBrowser:e,version:w||n(/(?:SamsungBrowser)[\s\/](\d+(\.\d+)?)/i)}:/coast/i.test(t)?T={name:"Opera Coast",coast:e,version:w||n(/(?:coast)[\s\/](\d+(\.\d+)?)/i)}:/yabrowser/i.test(t)?T={name:"Yandex Browser",yandexbrowser:e,version:w||n(/(?:yabrowser)[\s\/](\d+(\.\d+)?)/i)}:/ucbrowser/i.test(t)?T={name:"UC Browser",ucbrowser:e,version:n(/(?:ucbrowser)[\s\/](\d+(?:\.\d+)+)/i)}:/mxios/i.test(t)?T={name:"Maxthon",maxthon:e,version:n(/(?:mxios)[\s\/](\d+(?:\.\d+)+)/i)}:/epiphany/i.test(t)?T={name:"Epiphany",epiphany:e,version:n(/(?:epiphany)[\s\/](\d+(?:\.\d+)+)/i)}:/puffin/i.test(t)?T={name:"Puffin",puffin:e,version:n(/(?:puffin)[\s\/](\d+(?:\.\d+)?)/i)}:/sleipnir/i.test(t)?T={name:"Sleipnir",sleipnir:e,version:n(/(?:sleipnir)[\s\/](\d+(?:\.\d+)+)/i)}:/k-meleon/i.test(t)?T={name:"K-Meleon",kMeleon:e,version:n(/(?:k-meleon)[\s\/](\d+(?:\.\d+)+)/i)}:d?(T={name:"Windows Phone",osname:"Windows Phone",windowsphone:e},b?(T.msedge=e,T.version=b):(T.msie=e,T.version=n(/iemobile\/(\d+(\.\d+)?)/i))):/msie|trident/i.test(t)?T={name:"Internet Explorer",msie:e,version:n(/(?:msie |rv:)(\d+(\.\d+)?)/i)}:f?T={name:"Chrome",osname:"Chrome OS",chromeos:e,chromeBook:e,chrome:e,version:n(/(?:chrome|crios|crmo)\/(\d+(\.\d+)?)/i)}:/edg([ea]|ios)/i.test(t)?T={name:"Microsoft Edge",msedge:e,version:b}:/vivaldi/i.test(t)?T={name:"Vivaldi",vivaldi:e,version:n(/vivaldi\/(\d+(\.\d+)?)/i)||w}:c?T={name:"Sailfish",osname:"Sailfish OS",sailfish:e,version:n(/sailfish\s?browser\/(\d+(\.\d+)?)/i)}:/seamonkey\//i.test(t)?T={name:"SeaMonkey",seamonkey:e,version:n(/seamonkey\/(\d+(\.\d+)?)/i)}:/firefox|iceweasel|fxios/i.test(t)?(T={name:"Firefox",firefox:e,version:n(/(?:firefox|iceweasel|fxios)[ \/](\d+(\.\d+)?)/i)},/\((mobile|tablet);[^\)]*rv:[\d\.]+\)/i.test(t)&&(T.firefoxos=e,T.osname="Firefox OS")):l?T={name:"Amazon Silk",silk:e,version:n(/silk\/(\d+(\.\d+)?)/i)}:/phantom/i.test(t)?T={name:"PhantomJS",phantom:e,version:n(/phantomjs\/(\d+(\.\d+)?)/i)}:/slimerjs/i.test(t)?T={name:"SlimerJS",slimer:e,version:n(/slimerjs\/(\d+(\.\d+)?)/i)}:/blackberry|\bbb\d+/i.test(t)||/rim\stablet/i.test(t)?T={name:"BlackBerry",osname:"BlackBerry OS",blackberry:e,version:w||n(/blackberry[\d]+\/(\d+(\.\d+)?)/i)}:p?(T={name:"WebOS",osname:"WebOS",webos:e,version:w||n(/w(?:eb)?osbrowser\/(\d+(\.\d+)?)/i)},/touchpad\//i.test(t)&&(T.touchpad=e)):/bada/i.test(t)?T={name:"Bada",osname:"Bada",bada:e,version:n(/dolfin\/(\d+(\.\d+)?)/i)}:h?T={name:"Tizen",osname:"Tizen",tizen:e,version:n(/(?:tizen\s?)?browser\/(\d+(\.\d+)?)/i)||w}:/qupzilla/i.test(t)?T={name:"QupZilla",qupzilla:e,version:n(/(?:qupzilla)[\s\/](\d+(?:\.\d+)+)/i)||w}:/chromium/i.test(t)?T={name:"Chromium",chromium:e,version:n(/(?:chromium)[\s\/](\d+(?:\.\d+)?)/i)||w}:/chrome|crios|crmo/i.test(t)?T={name:"Chrome",chrome:e,version:n(/(?:chrome|crios|crmo)\/(\d+(\.\d+)?)/i)}:o?T={name:"Android",version:w}:/safari|applewebkit/i.test(t)?(T={name:"Safari",safari:e},w&&(T.version=w)):i?(T={name:i=="iphone"?"iPhone":i=="ipad"?"iPad":"iPod"},w&&(T.version=w)):/googlebot/i.test(t)?T={name:"Googlebot",googlebot:e,version:n(/googlebot\/(\d+(\.\d+))/i)||w}:T={name:n(/^(.*)\/(.*) /),version:r(/^(.*)\/(.*) /)},!T.msedge&&/(apple)?webkit/i.test(t)?(/(apple)?webkit\/537\.36/i.test(t)?(T.name=T.name||"Blink",T.blink=e):(T.name=T.name||"Webkit",T.webkit=e),!T.version&&w&&(T.version=w)):!T.opera&&/gecko\//i.test(t)&&(T.name=T.name||"Gecko",T.gecko=e,T.version=T.version||n(/gecko\/(\d+(\.\d+)?)/i)),!T.windowsphone&&(o||T.silk)?(T.android=e,T.osname="Android"):!T.windowsphone&&i?(T[i]=e,T.ios=e,T.osname="iOS"):g?(T.mac=e,T.osname="macOS"):x?(T.xbox=e,T.osname="Xbox"):m?(T.windows=e,T.osname="Windows"):y&&(T.linux=e,T.osname="Linux");var C="";T.windows?C=N(n(/Windows ((NT|XP)( \d\d?.\d)?)/i)):T.windowsphone?C=n(/windows phone (?:os)?\s?(\d+(\.\d+)*)/i):T.mac?(C=n(/Mac OS X (\d+([_\.\s]\d+)*)/i),C=C.replace(/[_\s]/g,".")):i?(C=n(/os (\d+([_\s]\d+)*) like mac os x/i),C=C.replace(/[_\s]/g,".")):o?C=n(/android[ \/-](\d+(\.\d+)*)/i):T.webos?C=n(/(?:web|hpw)os\/(\d+(\.\d+)*)/i):T.blackberry?C=n(/rim\stablet\sos\s(\d+(\.\d+)*)/i):T.bada?C=n(/bada\/(\d+(\.\d+)*)/i):T.tizen&&(C=n(/tizen[\/\s](\d+(\.\d+)*)/i)),C&&(T.osversion=C);var k=!T.windows&&C.split(".")[0];if(E||a||i=="ipad"||o&&(k==3||k>=4&&!S)||T.silk)T.tablet=e;else if(S||i=="iphone"||i=="ipod"||o||u||T.blackberry||T.webos||T.bada)T.mobile=e;return T.msedge||T.msie&&T.version>=10||T.yandexbrowser&&T.version>=15||T.vivaldi&&T.version>=1||T.chrome&&T.version>=20||T.samsungBrowser&&T.version>=4||T.firefox&&T.version>=20||T.safari&&T.version>=6||T.opera&&T.version>=10||T.ios&&T.osversion&&T.osversion.split(".")[0]>=6||T.blackberry&&T.version>=10.1||T.chromium&&T.version>=20?T.a=e:T.msie&&T.version<10||T.chrome&&T.version<20||T.firefox&&T.version<20||T.safari&&T.version<6||T.opera&&T.version<10||T.ios&&T.osversion&&T.osversion.split(".")[0]<6||T.chromium&&T.version<20?T.c=e:T.x=e,T}function r(e){return e.split(".").length}function i(e,t){var n=[],r;if(Array.prototype.map)return Array.prototype.map.call(e,t);for(r=0;r<e.length;r++)n.push(t(e[r]));return n}function s(e){var t=Math.max(r(e[0]),r(e[1])),n=i(e,function(e){var n=t-r(e);return e+=(new Array(n+1)).join(".0"),i(e.split("."),function(e){return(new Array(20-e.length)).join("0")+e}).reverse()});while(--t>=0){if(n[0][t]>n[1][t])return 1;if(n[0][t]!==n[1][t])return-1;if(t===0)return 0}}function o(e,r,i){var o=n;typeof r=="string"&&(i=r,r=void 0),r===void 0&&(r=!1),i&&(o=t(i));var u=""+o.version;for(var a in e)if(e.hasOwnProperty(a)&&o[a]){if(typeof e[a]!="string")throw new Error("Browser version in the minVersion map should be a string: "+a+": "+String(e));return s([u,e[a]])<0}return r}function u(e,t,n){return!o(e,t,n)}var e=!0,n=t(typeof navigator!="undefined"?navigator.userAgent||"":"");return n.test=function(e){for(var t=0;t<e.length;++t){var r=e[t];if(typeof r=="string"&&r in n)return!0}return!1},n.isUnsupportedBrowser=o,n.compareVersions=s,n.check=u,n._detect=t,n.detect=t,n})
\ No newline at end of file
// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
// be compatible with the behavior of Pandoc < 2.8).
document.addEventListener('DOMContentLoaded', function(e) {
var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
var i, h, a;
for (i = 0; i < hs.length; i++) {
h = hs[i];
if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
a = h.attributes;
while (a.length > 0) h.removeAttribute(a[0].name);
}
});
---
title: "curl vs RCurl or: how to choose a package"
description: |
In the first entry of the #dstexts series, I ditch old timer RCurl for the new, shiny curl and talk about my five criteria for choosing R packages.
date: 2019-05-21
preview: preview.jpeg
output:
distill::distill_article:
self_contained: false
---
## The text message
Today's text message is from my good friend [Pablo](https://twitter.com/pablocalv). Pablo is currently in the last months of his PhD in Survey Research at the University of Salamanca, Spain. I know him from my Erasmus year at the University of Essex where we were flatmates and both took classes in survey research. Originally a SPSS / Stata guy, he has been using R more and more over the last few years and I've been his personal "R guru". Which is probably my dream job, tbh.
Anyway, to the text message (excuse the weird highlighting, still figuring that one out):
--------------------
```
Pablo Cabrera Alvarez, [17.05.19 12:42]
Hi Frie
Pablo Cabrera Alvarez, [17.05.19 12:43]
I'm desperate with something I need your help
Pablo Cabrera Alvarez, [17.05.19 12:43]
😭😭😭😭
Frie, [17.05.19 12:44]
Oh no what
Frie, [17.05.19 12:44]
Is happening
Pablo Cabrera Alvarez, [17.05.19 12:45]
look, I have this webpage from which I want to download content: download.files() That's ok
[SOME UNHELPFUL BANTER FROM MY SIDE]
Pablo Cabrera Alvarez, [17.05.19 12:45]
My problem is that the webpage needs "authentication"
Frie, [17.05.19 12:45]
Oh OK
Pablo Cabrera Alvarez, [17.05.19 12:45]
I have the credentials
Frie, [17.05.19 12:45]
Yes
Frie, [17.05.19 12:45]
Ah
Frie, [17.05.19 12:45]
Mh
Pablo Cabrera Alvarez, [17.05.19 12:45]
I have tried with Rcurl
Frie, [17.05.19 12:45]
And?
Pablo Cabrera Alvarez, [17.05.19 12:46]
but it looks like the SSL protocol is different
Pablo Cabrera Alvarez, [17.05.19 12:46]
look, this si the error
Frie, [17.05.19 12:46]
Yes
Frie, [17.05.19 12:46]
Can you send me the command?
Frie, [17.05.19 12:46]
I have a bit time to look into it
Pablo Cabrera Alvarez, [17.05.19 12:46]
x <- getURL("https://THISWEBSITE/THISFILE.zip", userpwd="USER:PASSWORD6", httpauth = 4)
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
```
------------------------------------------------
In summary, Pablo wanted to use R to download a zip file from the Internet. Of course, he could've just downloaded it manually via the browser and put it into his `data` directory. But doing this in code is actually nice because it increases reproducability and at the same time documents where the data is coming from.
Usually you can achieve this in R by simply using `download.file`. However, when the file is in any way protected, things get a little bit more complicated. In this case, the file was protected with so called "basic auth". [Basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication) just means plain old username and password. If you have ever had an ugly looking popup asking you for username and password, that was probably Basic Auth. In those cases, you often have to use a `curl` wrapper in R. `curl` is broadly speaking a software for "transferring data in various protocols" ([Wikipedia](https://en.wikipedia.org/wiki/CURL)). It consists of a C library called `libcurl` and a command-line tool called `curl`.
Enough background info. Let's get to how I solved it.
## My answer
(If you want to skip the story, go straight to the [solution](#solution).)
My initial reaction was: "oh boy, this looks nasty." I had never seen any error like this before. I knew that an `tlsv1 alert protocol version` error was probably not coming from a simple mistake that would be easy to a) debug and b) fix. At least not for me.
What I did know was that the last time I personally had used the `RCurl` package had been in 2014. Since then, I had managed with just using `httr`. But I also remembered that there was a newer R package called `curl`.
In the end, my debugging strategy was:
1) Try with command line `curl` to rule out server-side errors or errors at the system library level.
2) If command line `curl` is successful, use R package `curl`.
As this conversation happened right at the end of my lunch break (hi, boss, if you ever read this :wave:) and I did not have much time left, I decided to skip 1) and go straight to 2).
(Editing Frie: The following is how I **think** my process was. Maybe it was totally different?!?! Next time, I'll screen-record.)
I installed the `curl` R package on my machine. Next up was probably googling "curl R package" which led me to its [website](https://cran.r-project.org/web/packages/curl/vignettes/intro.html). Right at the start is a summary of the most important functions:
```
curl_fetch_memory() saves response in memory
curl_download() or curl_fetch_disk() writes response to disk
curl() or curl_fetch_stream() streams response data
curl_fetch_multi() (Advanced) process responses via callback functions
```
It took me some minutes of *not* very carefully reading to comprehend that what I needed was `curl_download`. After I had realized this, I headed back to RStudio and typed `?curl::curl_download` in the console to open the help.
From the Description:
> Libcurl implementation of C_download (the "internal" download method) with added support for https, ftps, gzip, etc. Default behavior is identical to download.file, but request can be fully configured by passing a custom handle.
"fully configured" sounded good, so I had a look at the Usage section:
```{r eval=FALSE}
curl_download(url, destfile, quiet = TRUE, mode = "wb",
handle = new_handle())
```
From this, it was clear to me where I would need to insert the URL (`url`) and how I could specify the destination file (`destfile`). What was not so clear to me was how I could pass the username and password required for basic authentication.
But by process of elimination, it became clear to me that it probably had to go into the `handle` argument:
- `url`: probably the URL we want to download from
- `destfile`: probably the file we want to write to
- `quiet`: no idea but a boolean will not work for username/password. Plus, "quiet" has nothing to do with authentication
- `mode`: from looking at the default argument (`"wb"`), probably something with the file mode.
So, `handle` was the only one left. Plus, I vaguely remembered configuring so-called handle objects back when using `RCurl`.
What I had found out so far:
```{r eval=FALSE}
# destfile.zip will be in the current working directory of Pablo
curl::curl_download("https://THISWEBSITE/THISFILE.zip",
destfile = "destfile.zip",
handle = new_handle()) # this is not clear yet!
```
I took back to Firefox to find out more about the `handle`, specifically how to pass basic authentication details to it. Because I couldn't find the needed information on the detailed project website just by skimming (why read carefully if you can just jump around?), I tried the project's [GitHub page](https://github.com/jeroen/curl). Still, no luck as the "Hello World" examples only covered setting HTTP request headers but not authentication. So finally, I took the time to more *carefully* read the package website and alas, there was a section on ["Configuring a handle"](https://cran.r-project.org/web/packages/curl/vignettes/intro.html#configuring_a_handle).
> Creating a new handle is done using new_handle. After creating a handle object, we can set the libcurl options and http request headers.
> Use the curl_options() function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive.
"Curl options" sounded good: Over the course of the last 1.5 years, I have written a lot of `curl` requests in the terminal, e.g. to do quick checks on databases. From this experience, I know that there are command line options for setting basic authentication in the terminal `curl` command, so there should be underlying `libcurl` equivalents because after all, terminal `curl` relies on `libcurl`.
Does this even make sense?
Anyway, I got the options:
```{r}
length(curl::curl_options())
```
Of course, I entered `curl::curl_options()` to see all the options. But because there are quite a lot and I want to save you from endlessly scrolling, I have added the `length` for the purpose of this blog post. Getting all options printed out is left as an exercise to the reader. :wink:
Because I didn't have time to read all those 251 options, I decided to take the Google route again and try to find the name of the option on the Internet:
![Googling options](google_options.png)
Nice! Especially the `CURLOPT_USERPWD` immediately appealed to me because in his original `RCurl` command, Pablo had a `userpwd` argument as well. Without even checking the links, I headed back to R to find out whether there were any options matching those I found:
```{r}
options <- curl::curl_options()
tail(options, 10)
```
Bingo for `userpwd`!
## Final solution {#solution}
Now I was ready to set up my handle. From the package website, I knew that setting options was done with `curl::handle_setopt`:
```{r solution, eval=FALSE}
# install.packages("curl")
library(curl)
# in the handle we can specify options available to the underlying libcurl system library.
# ?curl::curl_options() -> display all options
h <- curl::new_handle()
curl::handle_setopt(h, userpwd = "USER:PASSWORD")
# ?curl::curl_download()
curl::curl_download("https://THISWEBSITE/THISFILE.zip",
destfile = "destfile.zip", handle = h)
```
I crossed my fingers and executed the command. And it just worked - not something that usually happens to me. I saved the code in a file and sent it to Pablo, still not sure it'd work on his computer as well. But it did! How cool!
```
Frie, [17.05.19 13:01]
well does it work for starters? ;)
Frie, [17.05.19 13:01]
(as it depends on system library, could also not work on your machine)
Pablo Cabrera Alvarez, [17.05.19 13:01]
I owe you more than one dinner, believe me
Pablo Cabrera Alvarez, [17.05.19 13:01]
yes yes, I just tried
Pablo Cabrera Alvarez, [17.05.19 13:02]
it's perfect
```
After approximately 15 minutes, issue solved.:muscle:
## Non-technical knowledge or: how to choose a package
However, there was still an open question:
```
Pablo Cabrera Alvarez, [17.05.19 13:01]
how did you know?? I have been three hours visiting forums and stuff
```
By that time, I really had to get back to work so my answer was a bit short and off-cutting. But it's a good question that points to the importance of what I like to call "non-technical knowledge". What I mean by this is having the knowledge to answer questions like:
- what packages exist for solving problem z?
- which package do I use for solving z? x or y?
- is this Stackoverflow answer worth trying out?
- how do I google my problem?
- where can I find good information?
- ...
Of course, technical skills help with answering those questions but it is not quite the same.
While I could talk about each of those questions for ages, let's focus on the first two for the moment: How did I knew about the `curl` package and why did I prefer it over `RCurl`?
For me personally, the answer to the first question boils down to keeping up with the latest developments in R. I use Twitter for that purpose because the R community is quite active there (under the hashtag #rstats, not #R!) and I follow many many R users and developers. For all people who do not want to ruin their phone usage statistics, Maëlle Salmon has written a good blog post on ["Keeping up to date with R news"](https://masalmon.eu/2019/01/25/uptodate/). Among her recommendations are mailing lists, news aggregators like [R-Bloggers](https://www.r-bloggers.com/) or [R Weekly](rweekly.org), attending meetups and conferences and much more.
As for the second question - "do I use package x or y?" -, I think the following "rules" feed into my decision:
1. **use the tidyverse (or ROpenSci) version if there is one:** The [tidyverse](https://www.tidyverse.org/) is probably *the* biggest change the R language has experienced in the last ~5 years. Thanks to the core developers being actually employed for doing this work by RStudio, tidyverse packages, the official ones in particular, are very well maintained and up to date. Similarly, the non-profit initiative [rOpenSci](https://ropensci.org/), maintains a [list of packages](https://ropensci.org/packages/) that are "carefully vetted, staff- and community-contributed R software tools that lower barriers to working with scientific data sources and data that support research applications on the web."
So, if I have to choose between a package that is part of tidyverse or rOpenSci and one that is not, I'll always choose the former.
2. **use the more popular package (e.g. CRAN downloads):** There are almost 15,000 R packages on CRAN^[Source: [https://cran.r-project.org/web/packages/](https://cran.r-project.org/web/packages/)], a massive number. Of course, each package has its value but in general, the more downloads a package has, the higher the probability it'll work in my experience. Another indicator of importance / popularity are the number of GitHub stars.
Popular packages are just too important to be left without updates and bug fixes (at this point, let's have a round of applause for all the open source developers who put a lot of work and heart - often in their free time - into developing R packages! :clap::clap::clap:).
3. **use the newer package / don't use an unmaintained package**: Newer is not always better but if the publication date of the package I've encountered during my Google search is a few years back, I'll try to google again. You can find the publication date of a package on its CRAN page. Especially given that Stackoverflow answers go back over 10 years, I find it worth checking the date of the answer and the publication date of the recommended package. **Update 2019-05-22, 20:37**: This is particularly relevant because old, unmaintained packages can have serious security issues which can be desasterous. It is not just a thing of it working or not working. Even if it works, it could still be the case that it is not properly protected against newer types of vulnerabilities.
4. **use the package with the better documentation**: This is just out of convenience. I am not the biggest fan of using the built-in help because it often does not provide enough context for me to get started. This is why I really love me a good GitHub Readme or even package website like [https://rplumber.io](https://rplumber.io) (again, round of applause for those writing docs :clap::clap::clap:). If in doubt, I'll choose the package with more / better documentation. This does not mean that the a package with less docs is necessarily worse at doing its job. But it's just easier to start out with an example from the Readme than to be left alone with `?`.
5. **use packages from people that I trust to be good developers:** This final "rule" feeds back nicely to "staying updated". If I see that a package has been developed by someone I "know" from Twitter, I'm more likely to trust that it is good. Which is a bit silly because someone without Twitter could be as good as a developer as this Twitter person with 10,000 followers. For me personally, it just serves as an additional way of establishing trust in the quality of the package.