Hi,
I've been excited to explore DVC as I think it can offer a lot for some of the things I'm working on. I've had a couple of issues with dvc get
and dvc get-url
that I think are probably bugs. I've put them together to avoid creating lots of issues; apologies if you'd rather they were separate, I can move them into different threads.
I am testing as of v1.1.1.
After v1.0 dvc get
pointed to a directory does not result in correctly placed files. All of the files are placed in the root, rathe than subdirectories. I've only tested this with a git repository that is not a DVC repository. It worked in v0.94, but does not work correctly in v1.0 or v1.1. dvc import
works correctly.
The following example gets this directory:
$ dvc get https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
βββ README.md
βββ __init__.py
βββ el_recipe.py
βββ el_tutorial.py
βββ emerson_annotated_text.jsonl
βββ emerson_input_text.txt
βββ entities.csv
βββ nel_schema.png
βββ notebook_video.ipynb
βββ requirements.txt
0 directories, 10 files
And here's dvc import
, which works correctly:
$ dvc import https://github.com/explosion/projects nel-emerson
$ tree nel-emerson/
nel-emerson/
βββ README.md
βββ __init__.py
βββ input
βΒ Β βββ entities.csv
βββ prodigy
βΒ Β βββ emerson_annotated_text.jsonl
βΒ Β βββ emerson_input_text.txt
βββ requirements.txt
βββ scripts
βββ el_recipe.py
βββ el_tutorial.py
βββ nel_schema.png
βββ notebook_video.ipynb
3 directories, 10 files
I understand why import-url
needs the strong ETag, but I don't think it makes sense for get-url
, so I suspect this is a bug. If so it's a regression after v1.0, as it didn't happen in v0.94.
$ dvc get-url https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt
ERROR: failed to get 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt' - Weak ETags are not supported. (Etag: 'W/"24c3b7f31292b04ac8fa9938c6e14baed94a089944652e8abf7dbb9459ae5b56"', URL: 'https://raw.githubusercontent.com/explosion/projects/master/nel-emerson/requirements.txt')
This is maybe more of a feature request than a bug report, although if it's expected behaviour, I think at least the error message needs to be different. I also think it's an important feature, as it's pretty common to have an https URL sitting in front of an AWS bucket or other file storage, as direct storage URLs don't allow much flexibility for moving data.
There's also a specific, very common scenario where this comes up: binary data such as trained models are often attached to Github releases as release artefacts. These artefacts are addressed as URLs that redirect to S3. For instance:
$ dvc get-url https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
ERROR: failed to get 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' - dependency 'https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz' does not exist
While wget
works fine:
wget https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
--2020-06-29 19:29:31-- https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
...
Saving to: βro_core_news_sm-2.3.1.tar.gz
First issue (dvc get
not preserving directory tree) should now be fixed in master.
For weak etags we need to just remove this https://github.com/iterative/dvc/blob/0275a0d8d876e230e4b287b52cead53b5b492005/dvc/remote/http.py#L138 . Doesn't look like we shouldn't forbid it anymore, considering that we don't support external outputs scenario for HTTP.
I tried tackling redirect
issue.
Here is what I found.
By default, we are appending allow_redirects: True
to our requests. And it seems to be working fine in most cases. However, in this one the situation looks a bit different.
https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
redirects us to some aws address. The problem is that sending HEAD
request to original link returns 302
(eg. curl -I https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.1/ro_core_news_sm-2.3.1.tar.gz
) while doing the same with allow_redirects
(curl -IL
) results in 403
. It seems to me that it might be a server configuration that makes it respond that way.
It seems like sometimes admins block HEAD
to prevent robot traffic so I think that this case is not one-time incident.
The only solution that I see is to actually GET
the link on exists
. I don't think we want to do that by default.
@honnibal
Workaround for now would be to use dvc run wget...
and freeze the stage, if you want to preserve the information how the data was obtained.
EDIT:
Well, if we use streamed GET, we could try to use it instead of HEAD
request. Trying out this approach in #4390
@honnibal It seems that all issues included in this thread have been fixed. They (fixes) should be available from the next release.
For the record: releasing 1.5.1 right now. Thank you @pared ! :pray:
Most helpful comment
For the record: releasing 1.5.1 right now. Thank you @pared ! :pray: