Here's a small repro for the issue. I made two remotes, but, only pushed to the default one. get_url()
returns the path for local_remote2
, but, the file does not exist at the path as it's not pushed yet.
#! /usr/bin/env sh
set -ex
cd "$(mktemp -d)" || exit 1
git init
dvc init
dvc remote add -d local_remote "$(mktemp -d)"
dvc remote add local_remote2 "$(mktemp -d)"
echo "foo" > foo
dvc add foo
dvc push
python -c 'import dvc.api; print(dvc.api.get_url("foo"));'
ls -la $(python -c 'import dvc.api; print(dvc.api.get_url("foo", remote="local_remote"));')
ls -la $(python -c 'import dvc.api; print(dvc.api.get_url("foo", remote="local_remote2"));') # fails as it's not pushed yet
I am divided on what should actually happen. I mean, should DVC guarantee that the file exists or just show possible checksum path?
But, since this is a part of public API, looking from the user's perspective, get_url()
should make the guarantee that the file exists and should be reliable.
EDIT: Another perspective is that user should be careful themselves when using get_url()
and hence, no guarantee is essential (the files might get deleted by someone between get_url()
and actual read/write).
cc @iterative/engineering
@skshetry get_url
's implementation is pretty simple and it is meant just as a helper, that would save you hand-parsing remote url and adding md5[0:2]/md5[2:]
to it. It was never meant to do anything more than that. If you need to check if the file exists, you could api.open()
first, that would raise if cache doesn't exists, and then you could get_url
.
I agree "get URL" doesn't imply "check existence" but
If you need to check if the file exists, you could api.open() first, that would raise if cache doesn't exists
Wouldn't calling open
first defeat part of the purpose of get_url
? I mean to avoid downloading the whole file.
I guess we can just make sure the documentation of get_url
emphasizes there's no guarantee that the file exists at that location.
I.e. https://github.com/iterative/dvc.org/commit/89734fd326ad91fba1fc11673555c0b06cda7d85 (part of iterative/dvc.org/pull/908)
Wouldn't calling open first defeat part of the purpose of get_url? I mean to avoid downloading the whole file.
@jorgeorpinel , yep, however, I think @efiop was trying to say that you'll know if the file exists after you want to do something with it.
I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are _generating_ a URL instead of _getting_ it from somewhere.
How can you _get_ something that doesn't exist? :sweat_smile:
I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are generating a URL instead of getting it from somewhere.
Very good point by @mroutis ! get_url doesn't even imply that dvc has access to a remote, so it many times it won't even be able to check if the file exists anyway.
@skshetry Looks like we could close it now? Or is there anything left? 馃檪
Most helpful comment
@jorgeorpinel , yep, however, I think @efiop was trying to say that you'll know if the file exists after you want to do something with it.
I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are _generating_ a URL instead of _getting_ it from somewhere.
How can you _get_ something that doesn't exist? :sweat_smile: