Dvc: Issues with dvc get and dvc get-url after 1.0

Created on 29 Jun 2020  Β·  5Comments  Β·  Source: iterative/dvc

Hi,

I've been excited to explore DVC as I think it can offer a lot for some of the things I'm working on. I've had a couple of issues with dvc get and dvc get-url that I think are probably bugs. I've put them together to avoid creating lots of issues; apologies if you'd rather they were separate, I can move them into different threads.

I am testing as of v1.1.1.

dvc get does not recreate directories correctly

After v1.0 dvc get pointed to a directory does not result in correctly placed files. All of the files are placed in the root, rathe than subdirectories. I've only tested this with a git repository that is not a DVC repository. It worked in v0.94, but does not work correctly in v1.0 or v1.1. dvc import works correctly.

The following example gets this directory:

$ dvc get https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
β”œβ”€β”€ README.md
β”œβ”€β”€ __init__.py
β”œβ”€β”€ el_recipe.py
β”œβ”€β”€ el_tutorial.py
β”œβ”€β”€ emerson_annotated_text.jsonl
β”œβ”€β”€ emerson_input_text.txt
β”œβ”€β”€ entities.csv
β”œβ”€β”€ nel_schema.png
β”œβ”€β”€ notebook_video.ipynb
└── requirements.txt
0 directories, 10 files

And here's dvc import, which works correctly:

$ dvc import https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
β”œβ”€β”€ README.md
β”œβ”€β”€ __init__.py
β”œβ”€β”€ input
β”‚Β Β  └── entities.csv
β”œβ”€β”€ prodigy
β”‚Β Β  β”œβ”€β”€ emerson_annotated_text.jsonl
β”‚Β Β  └── emerson_input_text.txt
β”œβ”€β”€ requirements.txt
└── scripts
    β”œβ”€β”€ el_recipe.py
    β”œβ”€β”€ el_tutorial.py
    β”œβ”€β”€ nel_schema.png
    └── notebook_video.ipynb
3 directories, 10 files

dvc get-url fails on URLs with weak ETAGs

I understand why import-url needs the strong ETag, but I don't think it makes sense for get-url, so I suspect this is a bug. If so it's a regression after v1.0, as it didn't happen in v0.94.

$ dvc get-url https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt
ERROR: failed to get 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt' - Weak ETags are not supported. (Etag: 'W/"24c3b7f31292b04ac8fa9938c6e14baed94a089944652e8abf7dbb9459ae5b56"', URL: 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt')

dvc get-url does not follow redirects

This is maybe more of a feature request than a bug report, although if it's expected behaviour, I think at least the error message needs to be different. I also think it's an important feature, as it's pretty common to have an https URL sitting in front of an AWS bucket or other file storage, as direct storage URLs don't allow much flexibility for moving data.

There's also a specific, very common scenario where this comes up: binary data such as trained models are often attached to Github releases as release artefacts. These artefacts are addressed as URLs that redirect to S3. For instance:

$ dvc get-url https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
ERROR: failed to get 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' - dependency 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' does not exist

While wget works fine:

wget https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
--2020-06-29 19:29:31--  https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
...
Saving to: β€˜ro_core_news_sm-2.3.1.tar.gz
bug feature request p1-important

Most helpful comment

For the record: releasing 1.5.1 right now. Thank you @pared ! :pray:

All 5 comments

First issue (dvc get not preserving directory tree) should now be fixed in master.

For weak etags we need to just remove this https://github.com/iterative/dvc/blob/0275a0d8d876e230e4b287b52cead53b5b492005/dvc/remote/http.py#L138 . Doesn't look like we shouldn't forbid it anymore, considering that we don't support external outputs scenario for HTTP.

I tried tackling redirect issue.
Here is what I found.
By default, we are appending allow_redirects: True to our requests. And it seems to be working fine in most cases. However, in this one the situation looks a bit different.

https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz

redirects us to some aws address. The problem is that sending HEAD request to original link returns 302
(eg. curl -I https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz) while doing the same with allow_redirects (curl -IL) results in 403. It seems to me that it might be a server configuration that makes it respond that way.

It seems like sometimes admins block HEAD to prevent robot traffic so I think that this case is not one-time incident.

The only solution that I see is to actually GET the link on exists. I don't think we want to do that by default.

@honnibal
Workaround for now would be to use dvc run wget... and freeze the stage, if you want to preserve the information how the data was obtained.

EDIT:
Well, if we use streamed GET, we could try to use it instead of HEAD request. Trying out this approach in #4390

@honnibal It seems that all issues included in this thread have been fixed. They (fixes) should be available from the next release.

For the record: releasing 1.5.1 right now. Thank you @pared ! :pray:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  Β·  3Comments

gregfriedland picture gregfriedland  Β·  3Comments

TezRomacH picture TezRomacH  Β·  3Comments

ghost picture ghost  Β·  3Comments

mdscruggs picture mdscruggs  Β·  3Comments