Dvc: api: get_url() does not guarantee if the file exists

Created on 15 Jan 2020  路  5Comments  路  Source: iterative/dvc

Here's a small repro for the issue. I made two remotes, but, only pushed to the default one. get_url() returns the path for local_remote2, but, the file does not exist at the path as it's not pushed yet.

#! /usr/bin/env sh
set -ex

cd "$(mktemp -d)" || exit 1
git init
dvc init

dvc remote add -d local_remote "$(mktemp -d)"
dvc remote add local_remote2 "$(mktemp -d)"

echo "foo" > foo
dvc add foo

dvc push

python -c 'import dvc.api; print(dvc.api.get_url("foo"));'
ls -la $(python -c 'import dvc.api; print(dvc.api.get_url("foo", remote="local_remote"));')
ls -la $(python -c 'import dvc.api; print(dvc.api.get_url("foo", remote="local_remote2"));') # fails as it's not pushed yet

I am divided on what should actually happen. I mean, should DVC guarantee that the file exists or just show possible checksum path?

But, since this is a part of public API, looking from the user's perspective, get_url() should make the guarantee that the file exists and should be reliable.

EDIT: Another perspective is that user should be careful themselves when using get_url() and hence, no guarantee is essential (the files might get deleted by someone between get_url() and actual read/write).

cc @iterative/engineering

discussion

Most helpful comment

Wouldn't calling open first defeat part of the purpose of get_url? I mean to avoid downloading the whole file.

@jorgeorpinel , yep, however, I think @efiop was trying to say that you'll know if the file exists after you want to do something with it.

I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are _generating_ a URL instead of _getting_ it from somewhere.

How can you _get_ something that doesn't exist? :sweat_smile:

All 5 comments

@skshetry get_url's implementation is pretty simple and it is meant just as a helper, that would save you hand-parsing remote url and adding md5[0:2]/md5[2:] to it. It was never meant to do anything more than that. If you need to check if the file exists, you could api.open() first, that would raise if cache doesn't exists, and then you could get_url.

I agree "get URL" doesn't imply "check existence" but

If you need to check if the file exists, you could api.open() first, that would raise if cache doesn't exists

Wouldn't calling open first defeat part of the purpose of get_url? I mean to avoid downloading the whole file.

I guess we can just make sure the documentation of get_url emphasizes there's no guarantee that the file exists at that location.

I.e. https://github.com/iterative/dvc.org/commit/89734fd326ad91fba1fc11673555c0b06cda7d85 (part of iterative/dvc.org/pull/908)

Wouldn't calling open first defeat part of the purpose of get_url? I mean to avoid downloading the whole file.

@jorgeorpinel , yep, however, I think @efiop was trying to say that you'll know if the file exists after you want to do something with it.

I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are _generating_ a URL instead of _getting_ it from somewhere.

How can you _get_ something that doesn't exist? :sweat_smile:

I think it shouldn't verify if the URL leads to a file, or if it is accessible. I understand that the name could be misleading, since you are generating a URL instead of getting it from somewhere.

Very good point by @mroutis ! get_url doesn't even imply that dvc has access to a remote, so it many times it won't even be able to check if the file exists anyway.

@skshetry Looks like we could close it now? Or is there anything left? 馃檪

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mfrata picture mfrata  路  3Comments

prihoda picture prihoda  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

shcheklein picture shcheklein  路  3Comments