Drake: Native tracking of URLs

Created on 20 May 2019 · 15Comments · Source: ropensci/drake

Prework

[x] Read and abide by drake's code of conduct.
[x] Search for duplicates among the existing issues, both open and closed.

Description

I am considering special url_in() and url_out() functions to track remote data sources. Power users like @aedobbyn and @ldecicco-USGS have made routine use of triggers for this, and I feel like drake should make it easier than that. Related:

As I see it, the purpose of url_in() and url_out() should be to tell drake how to track changes. Otherwise, I think URLs should be treated as files, e.g. with the file trigger. Internally, we could represent URLs as strings of the form "u-BASE64_ENCODED_STRING", similar to "p-..." for files and "n-..." for namespaced functions. I introduced reusable infrastructure for this in version 7.0.0: https://github.com/ropensci/drake/blob/master/R/utils-encoding.R. The hashing logic would need some sensitive refactoring, but not a lot of grunt work.

To track a dataset, we could get an etag if available (https://github.com/ropensci/drake/issues/252#issuecomment-365342753) and fall back on the time stamp if we cannot find an etag. I am not knowledgeable about tracking data from websites, so I would be grateful for advice. cc @noamross and @sckott from #252.

new feature

Source

wlandau

👍2

Most helpful comment

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs?

This would be consistent with base::file() behavior, so wouldn't be a surprise to users I don't think.

bpbond on 20 May 2019

👍2

All 15 comments

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs? Seems safe to assume strings beginning with "http://", "ftp://", etc. are not the names of files people save locally. This simplifying assumption would make it so much easier for me to implement.

wlandau on 20 May 2019

👍1

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs?

This would be consistent with base::file() behavior, so wouldn't be a surprise to users I don't think.

bpbond on 20 May 2019

👍2

That's great precedent. file() supports file://, http://, https:// and ftp://. So all drake needs to do is detect those prefixes.

wlandau on 20 May 2019

From some initial experimentation, it looks like time stamps will not be enough. In fact, https://github.com/ropensci/drake/archive/v7.3.0.tar.gz has an ETag but no "Last-Modified" time stamp. Below, the "Date" is the access date.

library(curl)
url <- "https://github.com/ropensci/drake/archive/v7.3.0.tar.gz"
req <- curl_fetch_memory(url)
parse_headers(req$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                
#>  [2] "Transfer-Encoding: chunked"                                                     
#>  [3] "Access-Control-Allow-Origin: https://render.githubusercontent.com"              
#>  [4] "Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox"
#>  [5] "Strict-Transport-Security: max-age=31536000"                                    
#>  [6] "Vary: Authorization,Accept-Encoding"                                            
#>  [7] "X-Content-Type-Options: nosniff"                                                
#>  [8] "X-Frame-Options: deny"                                                          
#>  [9] "X-XSS-Protection: 1; mode=block"                                                
#> [10] "ETag: \"6c06be9ff0baceebc160558249c5c156d139c533\""                             
#> [11] "Content-Type: application/x-gzip"                                               
#> [12] "Content-Disposition: attachment; filename=drake-7.3.0.tar.gz"                   
#> [13] "X-Geo-Block-List: "                                                             
#> [14] "Date: Mon, 20 May 2019 16:18:34 GMT"                                            
#> [15] "X-GitHub-Request-Id: D4A5:6995:F56F6:1D5D62:5CE2D359"

^{Created on 2019-05-20 by the reprex package (v0.3.0)}

Proposal:

Try to get the ETag.
If it cannot be found, simply warn the user, return NA_character_, and move on.
If there is no internet, warn the user and try to get the ETag from last make().

wlandau on 20 May 2019

In my experience ETag and Last-Modified are not super common. Sites like Github have them, but many government data sources don't provide them. Common dataset sites in my field have a mix: Dryad uses only Last-Modified, while Figshare uses both ETag and Last-Modified

Is the idea that if you can't detect whether the remote file has changed, then you won't return anything at all?

sckott on 20 May 2019

Thanks, @sckott. I am trying to build in a fully automated way for drake to check if remote data dependencies change. Sketch of a relevant plan:

plan <- drake_plan(
  data = munge(file_in("https://site.com/dataset.csv")),
  analysis = other_stuff(data),
  ...
)

The first make(plan) should download the data and munge it. The next make(plan) should check the website and determine if dataset.csv changed since last time. If there are no changes (and no changes to the user's code) then make(plan) should skip this step.

Is there a more reliable way to check for changes than the modification time or ETag?

wlandau on 20 May 2019

I think one needs to leave it to the user what behavior to use. One option is something like url_in(..., update = c("etag", "date", "size")), where the update argument sets the hierarchy of things to check.

noamross on 20 May 2019

size is another option as noam said. though in my experience some servers unfortunatley don't give a Content-Length response header

following redirects is something to think about. a dataset at http://foo.com/bar.csv may throw a 301 and point to another URL. easy to follow redirects of course

sckott on 20 May 2019

Thank you both. The more I learn, the more I begin to think a fully black-boxed solution would be brittle. I will probably close #877.

What if we just made it easier for users to supply their own change triggers based on case-specific knowledge about the data? We could have simple utility functions like url_etag(), url_last_modified(), and url_size(), which could use curl and jsonlite to supply the desired metadata.

wlandau on 21 May 2019

Thinking it over more, the proposed url_*() utilities are not as magical as internet-friendly file_in()/file_out() functions, so the extra API complexity and the dependency on curl do not seem worth it. More for the manual: https://github.com/ropenscilabs/drake-manual/issues/93.

wlandau on 22 May 2019

Reopening. Just found out about BiocFileCache from @LiNk-NY. Should investigate its URL tracking code.

wlandau on 2 Jun 2019

👍1

Looks like BiocFileCache uses eTag, modification date, and the "expires" header: https://github.com/Bioconductor/BiocFileCache/blob/07968ad12c55bba0e85ec1a3ac2126479863dbad/R/httr.R#L10-L22. For drake, I think I would prefer to stick to the eTag and throw an error if the eTag is missing. Seems simpler to implement and explain to users, and the simplicity does not detract from existing functionality.

wlandau on 6 Jun 2019

Well, maybe we can splice eTag and modification time together in a composite hash-like string and throw an error if neither is available. The "expires" header does not seem important. It is probably out of scope for drake to judge whether a resource is stale.

wlandau on 6 Jun 2019

I am much more confident in #901 than my first attempt to implement this. file_in() URL functionality should be available soon. Thanks for all your help!

wlandau on 6 Jun 2019

Implemented in #901.

wlandau on 6 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings