Dvc: add: from external FS or local external outputs

Created on 29 Jan 2019  路  14Comments  路  Source: iterative/dvc

$ dvc add /Volumes/Seagate/storage/data/reddit-May2015_z.tsv.zip

It copies the file into the cache, removes the original file in the external drive and creates symlink instead (to the dvc cache).

Based on some offline discussions it confuses people. One of the biggest concerns - after removing DVC project (the entire dir with the code as well as DVC-cache) user sees a symlink to a file which does not exist.

  1. Why should DVC work with external files? Is there any meaningful scenario? If not - let's throw an error when adding external files.
  2. Even if we have a scenario for (1) let's not touch\remove external files?
  3. As a result of the command, I have reddit-May2015_z.tsv.zip.dvc file but not the actual file in the workspace (like reddit-May2015_z.tsv.zip). It is not clear how to instantiate the file in the workspace?
enhancement p0-critical ui

All 14 comments

This is just a usual "external output" case, where you want to extend your project's workspace with the workspace of your entire machine. I don't think it is a bug, but rather a miscommunication. I guess people expect dvc add to copy a file to their project?

The guy saw dvc add command and interpreted that as adding a file under DVC control. The dataset was deleted from the external drive after the project was deleted.

I expect this dvc add external-file scenario is quite common and we should do something about this. The external output scenario might be a good one but we should think more carefully about this scenario and how it intersects with the basic scenario. I'll think about the intersection and I'll try to propose some solution.

@dmpetrov I see, this is the guy from ODS you are talking about. Agreed. I think we could simply add a confirmation prompt for external local files along with something like --external flag to suppress it. Will take a closer look at this later.

Yeah, that's what I was thinking about. But let's think a bit more about this experience.

Or, as discussed with @shcheklein , we could require a special external cache setup for external outputs, same as we do with s3/gs/etc. That way it would error-out automatically, without us having to introduce --external flag, that is breaking the symmetry for s3/gs/etc external outputs.

More context on this - https://discuss.dvc.org/t/share-nas-data-in-server/180 . It looks it's a really common thing and the way our "external outputs" behave in this case is very confusing.

The scenario with external outputs makes seen but it is too advanced for dvc add and it confuses people and especially beginners. This is a usual error in the first-day experience no matter what is the user's background - hardcore users can easily make this mistake.

Let's prevent using external outputs in dvc add. Advanced users can use dvc run -o ~/Download/myfile (no command) instead of dvc add ~/Download/myfile.

It is also related to #2955. DVC needs to resolve paths, exits with an error if the path is outside the current repo (no external outputs). The paths inside the repo need to be converted to relevant paths.

Had a great discussion with @dmpetrov and @shcheklein that it might make sense to introduce a config option instead of --external for now, as it will be much less annoying to the advanced users.

But a proper solution here, is probably to introduce a notion of workspaces, that will enforce proper use. E.g. maybe something like

dvc workspace add mys3 s3://bucket/ruslan  # this might even automatically create .dvc/cache, similar to local workspace
dvc add s3://bucket/ruslan/path # will understand that the file is within the legal workspace

but it should carefully evaluated. We will do that later. For now we need to prevent new users from shooting themselves in the leg.

I have some questions about the new --external option (noticed in https://github.com/iterative/dvc.org/pull/1387#pullrequestreview-425721931):

  • Does dvc add <external_path> (without --external) error out now? (With a message indicating you need to use that option explicitly, I would expect)
  • Since it was also introduced to dvc run, why don't external dependencies need the same option? (Seems confusing UI-wise)

Does dvc add (without --external) error out now? (With a message indicating you need to use that option explicitly, I would expect)

@jorgeorpinel Yep, https://github.com/iterative/dvc/pull/3929/files#diff-43298ed5b9e6a1345c43c8aedf6971c3R92

Since it was also introduced to dvc run, why don't external dependencies need the same option? (Seems confusing UI-wise)

Because using them is safe and we haven't seen people misusing it.

OK. But --external seems like it could apply or affect any external stuff (both deps and outs). Should the option be --enable-external-outs or something shorter like --x-outs ?

@jorgeorpinel Agreed, the name could be made more explicit, but --external was added to simply prevent the misuse, it is only meant for advanced users. The external outs scenario is going to be revisited in the future, so I wouldn't bother renaming it again until then (at which point we will likely get rid of it).

Sounds good.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mfrata picture mfrata  路  3Comments

shcheklein picture shcheklein  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

shcheklein picture shcheklein  路  3Comments

robguinness picture robguinness  路  3Comments