Dvc: Overlapping functionality between add, import-url, import, and maybe get/get-url?

Created on 30 Aug 2019 · 24Comments · Source: iterative/dvc

I just realized dvc import and dvc import-url mainly perform a download + dvc add (plus slightly different tracking of the data is produces with import stages.

Wouldn't it be less confusing to just have dvc add support URLs as targets, and detect whether the target is a local/external file path, a remote URL, or a Git URL (external DVC project)? And then apply whichever logic corresponds to each target.

cc @iterative/engineering for thoughts

p.s. similarly, dvc get could include dvc get-url by detecting whether the URL provided is a Git URL. In total this would reduce 5 commands into 2.

question

Source

jorgeorpinel

All 24 comments

dvc add supports urls, but considers them as external outputs, while dvc import considers them as external dependencies. I don't really see any way to implicitly detect which one user actually wants. adding implies that you are adding something within your workspace, while importing is about bringing something from outside, hence the dvc add/import-url separation.

Regarding the detection, we try to keep things explicit to not create a mess, where some commands will have unrelated flags to the current application. For example --rev flag for dvc import doesn't really apply to dvc import-url. You might see that mess in dvc status and dvc status -c command already, which we plan to separate into data and pipeline parts, and we definitely don't want to create more of such confusing commands by mashing different commands together.

efiop on 30 Aug 2019

@jorgeorpinel You are right, they _seem_ to do the same thing (and this is confusing), but actually they do different things. Consider this test:

mkdir test
cd test/

dvc init --no-scm

touch /tmp/foo /tmp/bar

dvc add /tmp/foo
dvc import-url /tmp/bar

cat foo.dvc 
md5: 64ceba0f567f921105b5fad3374c8419
wdir: .
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
  path: /tmp/foo
  cache: true
  metric: false
  persist: false

cat bar.dvc
md5: f7bb21ba1f93f9c8100968883a71bbf5
wdir: .
locked: true
deps:
- md5: d41d8cd98f00b204e9800998ecf8427e
  path: /tmp/bar
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
  path: bar
  cache: true
  metric: false
  persist: false

So, dvc add /tmp/foo creates a _file-tracking_ DVC-file (which tracks an external/remote file), while dvc import-url /tmp/bar creates a _stage_ DVC-file, which represents a process that downloads an external/remote file to the workspace, and then tracks this local file. The field locked: true means that this stage is locked and this process will never run again (unless we unlock it). So, basically it is the same as doing:

cp /tmp/bar .       # or 'wget url', etc.
dvc add bar

But dvc import-url also records the origin of the file, where did you get it from, so that you don't forget it.

Please let me know if I have some mistakes (which is quite possible).

dashohoxha on 31 Aug 2019

@efiop

dvc add supports urls...

I didn't know that 😮 (not documented in https://dvc.org/doc/commands-reference/add). How does it work? I just tried dvc add http://code.dvc.org/get-started/code.zip and it threw an error.

...but considers them as _external outputs_, while dvc import considers them as _external dependencies_.

True, they're different as Dashamir also explained.

adding implies that you are adding something within your workspace, and then tracks this local file

In this case I disagree. An external output is not in the workspace. I guess you mean "within your local file system"?

Thanks

jorgeorpinel on 1 Sep 2019

@jorgeorpinel It is a external output. See https://dvc.org/doc/user-guide/external-outputs . It doesn't work for http, because we only support external dependencies for it.

In this case I disagree. An external output is not in the workspace. I guess you mean "within your local file system"?

Not really, it could be an s3 file, which is not part of your local fs, but is part of your workspace. E.g. dvc add s3://bucket/path.

efiop on 1 Sep 2019

@jorgeorpinel When we use dvc add, a copy of the file is definitely stored on the cache, no mater whether the file is local, external or remote. So it makes sense to say that we are including it in the workspace. We also track (store in cache) all the versions of that file, whenever it changes, with dvc commit. So I would rather call it a "tracked file" instead of "output file".

dashohoxha on 1 Sep 2019

👍1

Not really, it could be an s3 file, which is not part of your local fs, but is part of your workspace. E.g. dvc add s3://bucket/path.

@efiop hmmm... How do you define "workspace"? This is what we have in the glossary now:

Do you say the added data is "part of the workspace" because it's downloaded into the local cache as @dashohoxha mentioned?

p.s. I also agree there's ambiguity between the terms "tracked" and "output"... Maybe we should favor "tracked file" most of the time in the docs, while only talking about "outputs" in the context of internal DVC-fil (YAML) structure... What do you guys think?

jorgeorpinel on 2 Sep 2019

When we use dvc add, a copy of the file is definitely stored on the cache, no mater whether the file is local, external or remote.

Actually @dashohoxha are you sure about this? Let's try it.

jorgeorpinel on 2 Sep 2019

When we use dvc add, a copy of the file is definitely stored on the cache, no mater whether the file is local, external or remote.

@dashohoxha I tried your dvc add example above and indeed the file is copied to the .dvc/cache dir (as well as remaining in /tmp/foo). No file links supported in this case right @efiop?

So given this, I'm not sure the difference between add and import-url is that clear anymore guys, other than how the DVC-file is written of course (but that is too technical for regular users to remember). Basically, both download/copy an external file and then track it...

And why does import-url even support local file system paths? (That's not a URL.) There seems to be some confusing overlapping in the way these commands work.

jorgeorpinel on 2 Sep 2019

No file links supported in this case right @efiop?

They are supported, but you probably have default reflink, copy.

So given this, I'm not sure the difference between add and import-url is that clear anymore guys, other than how the DVC-file is written of course (but that is too technical for regular users to remember). Basically, both download/copy an external file and then track it...

The difference is that dvc add /some/path/outside/of/repo doesn't bring the copy to your in-repo workspace, like dvc import-url does. And import-url doesn't touch the original, caching only the in-repo downloaded copy.

And why does import-url even support local file system paths? (That's not a URL.) There seems to be some confusing overlapping in the way these commands work.

Think about it as a corner case of a URL.

efiop on 2 Sep 2019

👍1

Thanks @efiop. Firstly, I don't think you noticed my questions in https://github.com/iterative/dvc/issues/2455#issuecomment-526961346:

How do you define "workspace"?
Do you say the added data is "part of the workspace" because it's downloaded into the local cache? (Assuming its a local cache and not an external one)

And sorry I'm still confused 🙁

No file links supported in this case right?

They are supported...
And import-url doesn't touch the original...

So in the dvc add example above /tmp/foo is moved to .dvc/cache/d4/1d8cd98f00b204e9800998ecf8427e and then reflinked back to /tmp/foo? I tried verifying this on my system but it's really hard to check reflinks on Mac.

jorgeorpinel on 3 Sep 2019

So in the dvc add example above /tmp/foo is moved to .dvc/cache/d4/1d8cd98f00b204e9800998ecf8427e and then reflinked back to /tmp/foo?

UPDATE: @mroutis confirmed this. (See this and this messages.)

Couldn't this cause problems since the external file could be used by other applications in the same system? I.e. wouldn't it make more sense to not touch the file, and just reflink it into cache?

Cc @shcheklein

jorgeorpinel on 3 Sep 2019

@jorgeorpinel , dvc import-url targeting a local file is the same as a cp src dest && mv src .dvc/cache/${hash} && ln .dvc/cache/${hash} dest, while dvc add skips the initialcp`.

As you noticed, it will cause problems to move files around when they are used in other contexts, this is when dvc import-url becomes useful.

If dvc add becomes ln src .dvc/cache/${hash} && ln .dvc/cache/${hash} dest, in case of using symlinks, removing the src will result in data lost (since there's no way to checkout the original file), also, it will complicate things when uploading the cache to S3 or a data storage, how should it handle symlinks? (same with dvc pull and _hardlinks_).

I could argue that the import-url name doesn't contemplate the case of local files, since the src is not on a URL format, however, you could think it as an implicit file://

https://en.wikipedia.org/wiki/File_URI_scheme

ghost on 3 Sep 2019

As you noticed, it will cause problems to move files around when they are used in other contexts, this is when dvc import-url becomes useful.

So if the way dvc add behaves with external files is problematic, why does it exist? ~~I think dvc add should do what import-url does for external files. Then import-url wouldn't need to support file paths at all. But maybe~~ I'm missing something.

jorgeorpinel on 3 Sep 2019

@jorgeorpinel , could you explain a little bit more how should dvc add behave on the following case?

dvc init --no-scm
echo "foo" > foo
dvc add foo

ghost on 3 Sep 2019

@jorgeorpinel I think, it exists for the same reason dvc add s3://something exists and serves the same purpose, but when you have a second large "local" partition instead of s3. Think NAS/NFS mounted somewhere. import-url brings the data to your project (yes, we can utilize symlinks here, I'm not sure we are doing this by default).

Also, the key difference here is that add is expanding your workspace. In case of S3 url or file:// external URL you can think about your workspace which consists of your project dir + some remote location (or locations). dvc checkout can manage multiple locations in this scenario if you have external cache for them.

dvc import-url on the other hand is about the same project workspace - it bring files to you. It can be really painful in some cases.

Also, the key difference that follows from the "explanation" above is that in the case of add you address files in your scripts the way they were added in the first place. It does not change the "address". In case of import-url it changes - it becomes /path/to/project/something.

shcheklein on 3 Sep 2019

👍1

In my comprehension, dvc get-url some-url does the same thing as wget some-url (or similar, depending on the type of some-url). So it is superfluous.

In the same way, dvc import-url some-url practically does the same thing as dvc get-url some-url; dvc add downloaded-file. So, this one seems to be superfluous too.

I am not sure about the advantages of having these commands (get-url and import-url), but one of the disadvantages seems to be that they create some confusion with dvc get, dvc import, and even dvc add. So, maybe they should be deprecated.

dashohoxha on 3 Sep 2019

So it is superfluous.

depending on the type of some-url - this is the key. It is an omni-cloud interface. Yes, it's not extremely useful in my opinion, and was added mostly for symmetry, but I think it's not superfluous.

dvc import-url some-url practically does the same thing as dvc get-url some-url; dvc add downloaded-file

Again, not extremely useful command and we have been discussing this a lot. The most important difference though is that dvc import-url _keeps_ the connection (the URL an artifact was downloaded from) and can check for updates after that.

So, maybe they should be deprecated.

get-url and import-url considering all of the above that was the reason to make them as a separate commands + these "ugly" names with -. And focus get and import on actually important stuff and keep them as clean as possible. Since we don't advertise them (get-url/import-url) very actively in get started, I think it's fine to keep them for now.

shcheklein on 3 Sep 2019

depending on the type of some-url - this is the key. It is an omni-cloud interface.

From the developers' point of view, the cloud can be of any type and they should be prepared to handle all of them (the supported ones). So, an omni-cloud interface does make sense. From a user's point of view, he uses only one type of remote store, maybe two, and he knows how to download from them. So, an omni-cloud interface will only add complexity and confusion for him.

Since we don't advertise them (get-url/import-url) very actively in get started, I think it's fine to keep them for now.

Up to my comprehension, that is the meaning of _deprecation_, not advertising them, advising users to use better alternative options instead. So, that's fine.

dashohoxha on 3 Sep 2019

By the way, I think that commands get and import work in a way that I find a bit obscure and not transparent (and this makes them confusing).

Let's take the example presented on the man page (https://dvc.org/doc/commands-reference/get):

dvc get https://github.com/iterative/example-get-started model.pkl

I understand that https://github.com/iterative/example-get-started is the GitHub url of a DVC project. But if I go there I cannot find model.pkl. Where on earth is model.pkl? To find it I have to check the files evaluate.dvc, featurize.dvc, prepare.dvc, train.dvc and ... aha... it is one of the outputs of train.dvc. The description of the path option gives no clue at all about this and the first time I tried it I had a hard time figuring out what is this model.pkl and where it is located. It only says Path to data within DVC repository. when actually it is one of the outputs in some stage file. What if the project has tens of stage files in several branches, how should I find it out?

I am not sure how this can be improved, but maybe specifying the stage file as well may help. For example:

dvc get https://github.com/iterative/example-get-started \
        train.dvc/model.pkl

# or this:

dvc get https://github.com/iterative/example-get-started/train.dvc \
        model.pkl

The same goes for dvc import.

dashohoxha on 3 Sep 2019

@dashohoxha don't see how specifying the stage file can improve that, to be honest. It makes it easier to read for sure. But if I want to actually import or get something I have to find the stage file anyway that corresponds to the data I'd like to get. And I have to remember the data name I want to import/get.

Solution can be to provide an additional command dvc list to show all available artifacts.

shcheklein on 4 Sep 2019

@mroutis the case you ask me about in https://github.com/iterative/dvc/issues/2455#issuecomment-527294251 is not an external file path.

@shcheklein:

it exists for the same reason dvc add s3://something exists and serves the same purpose.

Not the same. dvc add doesn't modify s3://something.

but when you have a second large "local" partition ... import-url brings the data to your project

dvc add also brings the data into the project cache.

the key difference here is that add is expanding your workspace... you can think about your workspace which consists of your project dir + some remote locations. dvc checkout can manage multiple locations...
in the case of add... It does not change the address...

Great explanation, thanks! (This is why I was asking Ruslan for his definition of "workspace".) I updated the glossary entry and related docs in https://github.com/iterative/dvc.org/pull/601/commits/5a562a2cbd316bb7e107019815279e6101c2274c. (Please review in the corresponding PR if necessary.)

I still think that the behavior of dvc add is dangerous/intrusive as it basically removes the original external file (moving it into the project cache) and then links it back... This may break 3rd party processes or applications. But not sure there's a solution for this problem! (That's a question.)

p.s. off-topic reply:

@dashohoxha

From the developers' point of view, the cloud can be of any type and they should be prepared to handle all of them...

get-url/import-url can handle the same protocols as the types of remotes that DVC supports so it may make sense to have the commands IMO. Not critical commands though, for sure. Deprecation would mean removing the commands in a future release though... Someone could just look at DVC analytics and decide this later based on their usage.

jorgeorpinel on 4 Sep 2019

@dashohoxha on the topic of import files not existing in the DVC repos (https://github.com/iterative/dvc/issues/2455#issuecomment-527387858), this is explained in the same example you mention, immediately after the command you pasted:

Note that the model.pkl file doesn't actually exist in the data directory of the external Git repo. Instead, the corresponding DVC-file train.dvc is found.

But are there any other examples that need a similar note?

Solution can be to provide an additional command dvc list to show all available artifacts.

I like this idea though! Maybe dvc get --list and dvc import --list though? (Aliases for the same command) Should we create a separate ticket for this?

jorgeorpinel on 4 Sep 2019

I guess this issue can be closed at this point... If anyone has additional thoughts on the conclusion below or any other recent comments/questions, please let us know or open separate issues (:

I still think that the behavior of dvc add is dangerous/intrusive as it basically removes the original external file (moving it into the project cache) and then links it back... This may break 3rd party processes or applications. But not sure there's a solution for this problem!

Thanks all 🙇

jorgeorpinel on 4 Sep 2019

I still think that the behavior of dvc add is dangerous/intrusive as it basically removes the original external file (moving it into the project cache) and then links it back... This may break 3rd party processes or applications.

It should not be a problem with reflinks and copy because they don't move anything, they just copy. Maybe it is not a problem with hardlinks as well. It may be a problem only with symlinks, and only in marginal/special cases.

dashohoxha on 4 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings