Dvc: import: fetch from source project cache if url is a local file system path

Created on 12 Oct 2019  路  20Comments  路  Source: iterative/dvc

UPDATE: Jump to https://github.com/iterative/dvc/issues/2599#issuecomment-566346857 (notice also https://github.com/iterative/dvc/issues/2599#issuecomment-700337141)

DVC version: 0.62.1
Python version: 3.7.3
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: False
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

I create a DVC project in /path/to/data-reg and dvc add a data/ directory in it.

Then I create another DVC project in /path/to/project in the same file system and try to import data/ but get:

$ dvc import /path/to/data-reg data
Importing 'data (../data-reg-test)' -> 'data'
Missing cache for directory '../../../../private/var/folders/_c/3mt_xn_d4xl2ddsx2m98h_r40000gn/T/tmpkq6tyek3dvc-repo/data'.
Cache for files inside will be lost. Would you like to continue? Use '-f' to force. [y/n] y
ERROR: failed to import 'data' from '/path/to/data-reg'. - config file error: no remote specified. Setup default remote with
    dvc config core.remote <name>
or use:
    dvc pull -r <name>


Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

A few problems:

  • The first part of the message is completely mysterious to me. I don't understand what cache dir is missing and obviously that path isn't friendly.
  • I also don't know what it's asking me to confirm so I just say y and move on 馃構
  • The "no remote specified" is also kind of cryptic (doesn't specify which project is missing a remote) but OK, I get it: I didn't push data/ to a remote in the first project, so the new project can't download it. However (main point of this issue) it could simply copy it from the source project's cache, since they're on the same file system. This would be a nice-to-have feature I think. In fact it should probably try this before the default remote, for performance.
  • There's 2 new lines after dvc pull -r <name> in the error output.

UPDATE: Moved all the UI-related stuff to #2602 (per Ivan's https://github.com/iterative/dvc/issues/2599#issuecomment-541292359 below).

c8-full-day enhancement p1-important

All 20 comments

@jorgeorpinel I would probably split this into two separate issues - UI and optimization. It would be easier to track and discuss them.

Let's start with local, as it makes the most sense and is a natural thing to do. We need to at least show a warning that we are automatically using project's cache as a remote or maybe we need an explicit flag that would tell import/get that this is okay to do. On one hand, warning is a low-friction approach, but might be dangerous if user doesn't realize that he is not using a remote and then some colleague on another machine tries to dvc pull and there is no remote to pull imported data from. But, in that case the url of the git is a local url, that is not accessible on another machine, so user will run into that anyways. So maybe warning is enough. On another hand, an explicit flag would be a pain on dvc pull, as we will have to explicitly allow pulling from cache. Seems like the first approach is better.

Let's start with local

You mean use local repo and/or cache as default place to get/import from? I would agree with this.

show a warning that we are automatically using project's cache... might be dangerous if user doesn't realize that he is not using a remote...

Maybe either way output where the import is coming from as a regular INFO message. But honestly, the possibility that people won't pay attention to the output is there anyway. Also, in some other recent discussion about adding output warnings someone brought up that they can break shell scripts and cause other problems unless hidden under the -v output.

For the record: to implement this, we need to look into https://github.com/iterative/dvc/blob/0.71.0/dvc/external_repo.py#L68, where we need to check if the url we are cloning from is a local directory and if there is no default remote (core.remote in config) setup, in which case add default remote that will point to the original repo's cache location. Here is a quick test for this feature:

#!/bin/bash

set -x
set -e

rm -rf mytest
mkdir mytest
cd mytest

mkdir erepo
pushd erepo
git init
dvc init
echo foo > foo
dvc add foo
git add .
git commit -m "init"
popd

mkdir repo
pushd repo
git init
dvc init
dvc get ../erepo foo
popd

This isn't completely closed by #2915, but it's greet progress, from what I can see. I haven't actually tried the functionality but it seems like DVC is now getting/importing from local repo cache dir but only when no default remote is set in the source repo. See https://github.com/iterative/dvc/pull/2915#pullrequestreview-332983712 for context.

I tried to pick this one up. Big mistake. Spent a long time trying to figure out what was going on.

@fabiosantoscode What @jorgeorpinel means is a remote cascade of sorts, which is not currently supported in any shape or form. We might hack together something specifically for get/import, but it is better to solve it in general.

And in general the feature looks like: you have remote1 and remote2, and some files might be on one or another remote and we need dvc pull to poll both remotes, kinda like pkg mirrors in linux distros. General mechanism and configuration need to be discussed first.

Gotcha

@fabiosantoscode What @jorgeorpinel means is a remote cascade of sorts...

In the meantime, a quick fix could be a flag for dvc pull which tells it to ignore remotes in the source directory and only pull from local caches. Most of the time you'd need this feature your files would already be in your local caches, and if not dvc could just throw the normal not found in cache error.

In my workflows, for example, I've got raw data being stored in data registries which are then imported into another repository for processing. Currently this import goes out to the remote, which wastes a lot of time since all the files are already available locally.

@charlesbaynham import won't try to download things if it already has them in cache. Have you tried running it?

@efiop sure, but they're not currently in the cache of the project that's calling import; they're in the cache (and remote) of its target, data registry. The work-around I've used to save on storage space as well is to have a shared cache that both projects use. I figure that, as long as I avoid garbage collection, that shouldn't cause issues.

@charlesbaynham So you have a local repo that you dvc import from, which has default remote setup, and you would like dvc to try to fetch from its local cache instead of the default remote, right?

If so, I don't think flag for dvc pull is a good idea, it will mess around with non-imports too and is a rather hacky approach to what we could do automatically, so I would not be willing to introduce a public flag for it. Your shared cache approach does the trick in a much nicer way, in my opinion :slightly_smiling_face:

Yes, that exactly right. Sorry, not sure why I was talking about pull, import is obviously the relevant command here. import would be the place for the hack-flag, but I agree that it is a hack. A proper prioritised search of caches and remotes would definitely be better.

Hi! Another thing I noticed here is that if the source repo has tracked data (e.g. .dvc file exists) but it hasn't been committed to Git, the import of that data fails. Again, this has to do with cloning the repo instead of just using the existing files already in the file system. Not sure if that would change with the hypothetical implementations discussed above.

It's probably not a big deal anyway, just a little confusing because the error message is The path '...' does not exist in the target repository, but it does exist in the source repo (untracked), just not in DVC's internal /tmp/ repo clone. It could be a bigger problem if you import something that is already committed to Git but in an outdated version (newer version exists in source but is uncommitted): no error would happen and you would assume you have imported the latest version, while it's the previous one.

DVC 1.7.2 (exe) on Win

BTW I'm still unable to import DVC-tracked data from a local file system path that (is cached but) hasn't been pushed to remote storage.

WARNING: Some of the cache files do not exist neither locally nor on remote...
ERROR: unexpected error - [Errno 2] No such file or directory...

DVC 1.7.2 (exe) on Win

@jorgeorpinel Please don't forget to post a command you are running and a verbose log. Otherwise it is too hard to reproduce when we get to it.

unable to import DVC-tracked data from a local file system path that (is cached but) hasn't been pushed

It works now (updated to 1.7.9). Will remember to post the command going fwd, sorry!

if the source repo has tracked data (e.g. .dvc file exists) but it hasn't been committed to Git, the import of that data fails

This is still the case, but again I suppose it's expected? Full log below:

位 dvc import ../srcrepo/ data -v
2020-09-28 19:31:15,359 DEBUG: Check for update is enabled.
2020-09-28 19:31:15,398 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-09-28 19:31:18,413 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-09-28 19:31:18,418 DEBUG: fetched: [(3,)]
2020-09-28 19:31:18,690 DEBUG: Removing output 'data' of stage: 'data.dvc'.
Importing 'data (../srcrepo/)' -> 'data'
2020-09-28 19:31:18,693 DEBUG: Computed stage: 'data.dvc' md5: '20df2cf5665233da0bb1b1fd26fc4a1b'
2020-09-28 19:31:18,694 DEBUG: 'md5' of stage: 'data.dvc' changed.
2020-09-28 19:31:18,695 DEBUG: Creating external repo ../srcrepo/@None
2020-09-28 19:31:18,695 DEBUG: erepo: git clone '../srcrepo/' to a temporary dir
2020-09-28 19:31:19,606 DEBUG: fetched: [(4,)]
2020-09-28 19:31:19,623 ERROR: failed to import 'data' from '../srcrepo/'. - The path '..\..\AppData\Local\Temp\tmp6eg6aiy2dvc-clone\data' does not exist in the target repository '../srcrepo/' neither as a DVC output nor as a Git-tracked file.:
------------------------------------------------------------
Traceback (most recent call last):
  File "site-packages\funcy\flow.py", line 109, in reraise
  File "dvc\external_repo.py", line 147, in fetch_external
  File "dvc\tree\repo.py", line 439, in metadata
FileNotFoundError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc\command\imp.py", line 18, in run
  File "dvc\repo\imp.py", line 6, in imp
  File "dvc\repo\__init__.py", line 51, in wrapper
  File "dvc\repo\scm_context.py", line 4, in run
  File "dvc\repo\imp_url.py", line 54, in imp_url
  File "site-packages\funcy\decorators.py", line 39, in wrapper
  File "dvc\stage\decorators.py", line 36, in rwlocked
  File "site-packages\funcy\decorators.py", line 60, in __call__
  File "dvc\stage\__init__.py", line 429, in run
  File "dvc\stage\imports.py", line 30, in sync_import
  File "dvc\dependency\repo.py", line 78, in download
  File "dvc\external_repo.py", line 147, in fetch_external
  File "contextlib.py", line 130, in __exit__
  File "site-packages\funcy\flow.py", line 111, in reraise
  File "<string>", line 3, in raise_from
dvc.exceptions.PathMissingError: The path '..\..\AppData\Local\Temp\tmp6eg6aiy2dvc-clone\data' does not exist in the target repository '../srcrepo/' neither as a DVC output nor as a Git-tracked file.
------------------------------------------------------------
2020-09-28 19:31:19,632 DEBUG: Analytics is disabled.

@jorgeorpinel that is a different error and in non-verbose log it will show this part of the verbose log:

The path '..\..\AppData\Local\Temp\tmp6eg6aiy2dvc-clone\data' does not exist in the target repository '../srcrepo/' neither as a DVC output nor as a Git-tracked file.

Which looks correct except for the ugly relpath. :slightly_frowning_face:

Well, it's not exactly correct. The file is tracked by Git already, just not committed. But since it's a local repo, DVC could find it in the working tree. I just don't know if we want to support this... WDYT @efiop? Up to you (I can create a separate issue BTW).

The message could at least be even more correct if it ended in something like "... nor as a file committed to Git."

@jorgeorpinel No need for a separate ticket. Could consider it within this one.

Was this page helpful?
0 / 5 - 0 ratings