Drake: Native tracking of URLs

Created on 20 May 2019  路  15Comments  路  Source: ropensci/drake

Prework

Description

I am considering special url_in() and url_out() functions to track remote data sources. Power users like @aedobbyn and @ldecicco-USGS have made routine use of triggers for this, and I feel like drake should make it easier than that. Related:

As I see it, the purpose of url_in() and url_out() should be to tell drake how to track changes. Otherwise, I think URLs should be treated as files, e.g. with the file trigger. Internally, we could represent URLs as strings of the form "u-BASE64_ENCODED_STRING", similar to "p-..." for files and "n-..." for namespaced functions. I introduced reusable infrastructure for this in version 7.0.0: https://github.com/ropensci/drake/blob/master/R/utils-encoding.R. The hashing logic would need some sensitive refactoring, but not a lot of grunt work.

To track a dataset, we could get an etag if available (https://github.com/ropensci/drake/issues/252#issuecomment-365342753) and fall back on the time stamp if we cannot find an etag. I am not knowledgeable about tracking data from websites, so I would be grateful for advice. cc @noamross and @sckott from #252.

new feature

Most helpful comment

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs?

This would be consistent with base::file() behavior, so wouldn't be a surprise to users I don't think.

All 15 comments

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs? Seems safe to assume strings beginning with "http://", "ftp://", etc. are not the names of files people save locally. This simplifying assumption would make it so much easier for me to implement.

Even easier: what if we just made file_in() and file_out() (and maybe also knitr_in()) handle URLs?

This would be consistent with base::file() behavior, so wouldn't be a surprise to users I don't think.

That's great precedent. file() supports file://, http://, https:// and ftp://. So all drake needs to do is detect those prefixes.

From some initial experimentation, it looks like time stamps will not be enough. In fact, https://github.com/ropensci/drake/archive/v7.3.0.tar.gz has an ETag but no "Last-Modified" time stamp. Below, the "Date" is the access date.

library(curl)
url <- "https://github.com/ropensci/drake/archive/v7.3.0.tar.gz"
req <- curl_fetch_memory(url)
parse_headers(req$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                
#>  [2] "Transfer-Encoding: chunked"                                                     
#>  [3] "Access-Control-Allow-Origin: https://render.githubusercontent.com"              
#>  [4] "Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox"
#>  [5] "Strict-Transport-Security: max-age=31536000"                                    
#>  [6] "Vary: Authorization,Accept-Encoding"                                            
#>  [7] "X-Content-Type-Options: nosniff"                                                
#>  [8] "X-Frame-Options: deny"                                                          
#>  [9] "X-XSS-Protection: 1; mode=block"                                                
#> [10] "ETag: \"6c06be9ff0baceebc160558249c5c156d139c533\""                             
#> [11] "Content-Type: application/x-gzip"                                               
#> [12] "Content-Disposition: attachment; filename=drake-7.3.0.tar.gz"                   
#> [13] "X-Geo-Block-List: "                                                             
#> [14] "Date: Mon, 20 May 2019 16:18:34 GMT"                                            
#> [15] "X-GitHub-Request-Id: D4A5:6995:F56F6:1D5D62:5CE2D359"

Created on 2019-05-20 by the reprex package (v0.3.0)

Proposal:

  1. Try to get the ETag.
  2. If it cannot be found, simply warn the user, return NA_character_, and move on.
  3. If there is no internet, warn the user and try to get the ETag from last make().

In my experience ETag and Last-Modified are not super common. Sites like Github have them, but many government data sources don't provide them. Common dataset sites in my field have a mix: Dryad uses only Last-Modified, while Figshare uses both ETag and Last-Modified

Is the idea that if you can't detect whether the remote file has changed, then you won't return anything at all?

Thanks, @sckott. I am trying to build in a fully automated way for drake to check if remote data dependencies change. Sketch of a relevant plan:

plan <- drake_plan(
  data = munge(file_in("https://site.com/dataset.csv")),
  analysis = other_stuff(data),
  ...
)

The first make(plan) should download the data and munge it. The next make(plan) should check the website and determine if dataset.csv changed since last time. If there are no changes (and no changes to the user's code) then make(plan) should skip this step.

Is there a more reliable way to check for changes than the modification time or ETag?

I think one needs to leave it to the user what behavior to use. One option is something like url_in(..., update = c("etag", "date", "size")), where the update argument sets the hierarchy of things to check.

size is another option as noam said. though in my experience some servers unfortunatley don't give a Content-Length response header

following redirects is something to think about. a dataset at http://foo.com/bar.csv may throw a 301 and point to another URL. easy to follow redirects of course

Thank you both. The more I learn, the more I begin to think a fully black-boxed solution would be brittle. I will probably close #877.

What if we just made it easier for users to supply their own change triggers based on case-specific knowledge about the data? We could have simple utility functions like url_etag(), url_last_modified(), and url_size(), which could use curl and jsonlite to supply the desired metadata.

Thinking it over more, the proposed url_*() utilities are not as magical as internet-friendly file_in()/file_out() functions, so the extra API complexity and the dependency on curl do not seem worth it. More for the manual: https://github.com/ropenscilabs/drake-manual/issues/93.

Reopening. Just found out about BiocFileCache from @LiNk-NY. Should investigate its URL tracking code.

Looks like BiocFileCache uses eTag, modification date, and the "expires" header: https://github.com/Bioconductor/BiocFileCache/blob/07968ad12c55bba0e85ec1a3ac2126479863dbad/R/httr.R#L10-L22. For drake, I think I would prefer to stick to the eTag and throw an error if the eTag is missing. Seems simpler to implement and explain to users, and the simplicity does not detract from existing functionality.

Well, maybe we can splice eTag and modification time together in a composite hash-like string and throw an error if neither is available. The "expires" header does not seem important. It is probably out of scope for drake to judge whether a resource is stale.

I am much more confident in #901 than my first attempt to implement this. file_in() URL functionality should be available soon. Thanks for all your help!

Implemented in #901.

Was this page helpful?
0 / 5 - 0 ratings