Dvc: import/get: handle chained import data

Created on 19 Aug 2020  路  14Comments  路  Source: iterative/dvc

UPDATE: Jump to https://github.com/iterative/dvc/issues/4423#issuecomment-676527748

Bug Report

位 dvc import https://github.com/iterative/example-get-started data/data.xml
Importing 'data/data.xml (https://github.com/iterative/example-get-started)' -> 'data.xml'
ERROR: unexpected error - [Errno 2] No such file or directory: 'C:\\Users\\poj12\\DVC-repos\\tests\\.dvc\\cache\\a3\\04afb96060aad90176268345e10355'

Full --verbose output:

位 dvc import https://github.com/iterative/example-get-started data/data.xml -v
2020-08-18 23:22:41,569 DEBUG: Check for update is enabled.
2020-08-18 23:22:41,752 ERROR: interrupted by the user
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\poj12\dvc\dvc\main.py", line 53, in main
    cmd = args.func(args)
  File "c:\users\poj12\dvc\dvc\command\base.py", line 40, in __init__
    updater.check()
  File "c:\users\poj12\dvc\dvc\updater.py", line 58, in check
    self._with_lock(self._check, "checking")
  File "c:\users\poj12\dvc\dvc\updater.py", line 44, in _with_lock
    func()
  File "c:\users\poj12\dvc\dvc\updater.py", line 62, in _check
    self.fetch()
  File "c:\users\poj12\dvc\dvc\updater.py", line 84, in fetch
    daemon(["updater"])
  File "c:\users\poj12\dvc\dvc\daemon.py", line 105, in daemon
    file_path = os.path.abspath(inspect.stack()[0][1])
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\inspect.py", line 1514, in stack
    return getouterframes(sys._getframe(1), context)
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\inspect.py", line 1491, in getouterframes
    frameinfo = (frame,) + getframeinfo(frame, context)
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\inspect.py", line 1465, in getframeinfo
    lines, lnum = findsource(frame)
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\inspect.py", line 792, in findsource
    module = getmodule(object, file)
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\inspect.py", line 754, in getmodule
    os.path.realpath(f)] = module.__name__
  File "C:\Users\poj12\AppData\Local\Programs\Python\Python38\lib\ntpath.py", line 647, in realpath
    path = _getfinalpathname(path)
KeyboardInterrupt
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2020-08-18 23:22:41,927 DEBUG: Analytics is disabled.
(.venv) poj12@AP-QDVJ7BLR ~/DVC-repos/tests (master)
位
(.venv) poj12@AP-QDVJ7BLR ~/DVC-repos/tests (master)
位
(.venv) poj12@AP-QDVJ7BLR ~/DVC-repos/tests (master)
位 dvc import https://github.com/iterative/example-get-started data/data.xml -v
2020-08-18 23:22:51,005 DEBUG: Check for update is enabled.
2020-08-18 23:22:51,516 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-08-18 23:22:52,092 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-08-18 23:22:52,117 DEBUG: fetched: [(3,)]
2020-08-18 23:22:53,407 DEBUG: Removing output 'data.xml' of stage: 'data.xml.dvc'.
Importing 'data/data.xml (https://github.com/iterative/example-get-started)' -> 'data.xml'
2020-08-18 23:22:53,416 DEBUG: Computed stage: 'data.xml.dvc' md5: 'e7514d625f896d082cc0ca259453b732'
2020-08-18 23:22:53,421 DEBUG: 'md5' of stage: 'data.xml.dvc' changed.
2020-08-18 23:22:53,425 DEBUG: Creating external repo https://github.com/iterative/example-get-started@None
2020-08-18 23:22:53,430 DEBUG: erepo: git clone 'https://github.com/iterative/example-get-started' to a temporary dir
2020-08-18 23:22:56,420 DEBUG: Saving '..\..\AppData\Local\Temp\tmppx2petbedvc-clone\data\data.xml' to '.dvc\cache\a3\04afb96060aad90176268345e10355'.
2020-08-18 23:22:56,428 DEBUG: cache 'C:\Users\poj12\DVC-repos\tests\.dvc\cache\a3\04afb96060aad90176268345e10355' expected 'a304afb96060aad90176268345e10355' actual 'None'
2020-08-18 23:22:56,439 DEBUG: cache 'C:\Users\poj12\DVC-repos\tests\.dvc\cache\a3\04afb96060aad90176268345e10355' expected 'a304afb96060aad90176268345e10355' actual 'None'
2020-08-18 23:22:56,508 DEBUG: Preparing to download data from 'https://remote.dvc.org/get-started'
2020-08-18 23:22:56,512 DEBUG: Preparing to collect status from https://remote.dvc.org/get-started
2020-08-18 23:22:56,517 DEBUG: Collecting information from local cache...
2020-08-18 23:22:56,638 DEBUG: fetched: [(45,)]
2020-08-18 23:22:56,697 ERROR: unexpected error - [Errno 2] No such file or directory: 'C:\\Users\\poj12\\DVC-repos\\tests\\.dvc\\cache\\a3\\04afb96060aad90176268345e10355'
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\poj12\dvc\dvc\main.py", line 54, in main
    ret = cmd.run()
  File "c:\users\poj12\dvc\dvc\command\imp.py", line 14, in run
    self.repo.imp(
  File "c:\users\poj12\dvc\dvc\repo\imp.py", line 6, in imp
    return self.imp_url(path, out=out, erepo=erepo, frozen=True)
  File "c:\users\poj12\dvc\dvc\repo\__init__.py", line 34, in wrapper
    ret = f(repo, *args, **kwargs)
  File "c:\users\poj12\dvc\dvc\repo\scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "c:\users\poj12\dvc\dvc\repo\imp_url.py", line 54, in imp_url
    stage.run()
  File "c:\users\poj12\dvc\.venv\lib\site-packages\funcy\decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "c:\users\poj12\dvc\dvc\stage\decorators.py", line 36, in rwlocked
    return call()
  File "c:\users\poj12\dvc\.venv\lib\site-packages\funcy\decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "c:\users\poj12\dvc\dvc\stage\__init__.py", line 429, in run
    sync_import(self, dry, force)
  File "c:\users\poj12\dvc\dvc\stage\imports.py", line 30, in sync_import
    stage.deps[0].download(stage.outs[0])
  File "c:\users\poj12\dvc\dvc\dependency\repo.py", line 97, in download
    _, _, cache_infos = repo.fetch_external([self.def_path])
  File "c:\users\poj12\dvc\dvc\external_repo.py", line 147, in fetch_external
    self.local_cache.save(
  File "c:\users\poj12\dvc\dvc\cache\base.py", line 282, in save
    return self._save(path_info, tree, hash_, save_link, **kwargs)
  File "c:\users\poj12\dvc\dvc\cache\base.py", line 290, in _save
    return self._save_file(path_info, tree, hash_, save_link, **kwargs)
  File "c:\users\poj12\dvc\dvc\cache\base.py", line 218, in _save_file
    with tree.open(path_info, mode="rb") as fobj:
  File "c:\users\poj12\dvc\dvc\repo\tree.py", line 372, in open
    return dvc_tree.open(path, mode=mode, encoding=encoding, **kwargs)
  File "c:\users\poj12\dvc\dvc\repo\tree.py", line 113, in open
    return open(cache_path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\poj12\\DVC-repos\\tests\\.dvc\\cache\\a3\\04afb96060aad90176268345e10355'
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2020-08-18 23:22:56,877 DEBUG: Analytics is disabled.

Similar problem for get:

位 dvc get https://github.com/iterative/example-get-started data/data.xml
ERROR: unexpected error - [Errno 2] No such file or directory: 'C:\\Users\\poj12\\DVC-repos\\.kWasCRdYJMf8x9qBkgneqt\\a3\\04afb96060aad90176268345e10355'

I tried from different locations, inside DVC repos o not.

Please provide information about your setup

Output of dvc version:

DVC version: 1.5.1
---------------------------------
Platform: Python 3.8.2 on Windows-10-10.0.18362-SP0
Supports: All remotes
Workspace directory: NTFS on C:\
Repo: dvc, git
bug p2-medium

All 14 comments

@jorgeorpinel, Does this also happen in 1.5.0? Can you checkout to 64f038fda83c2400263b335fca6bb4f63f6ecf0f and try?

Yeah, still happens on the latest master commit.

No, I just wanted to know if it's a recent issue or not. It'd be helpful if you could check on 1.5.0 as well.

Ah OK I got you. Yes, this also happens for me in 64f038fda83c2400263b335fca6bb4f63f6ecf0f (checked it out and ran pip install -e ".[all,tests]").

Can reproduce on linux. This is bad...

Yeah, I can too. Looks like granular import is broken. model.pkl is working, so does data.

Directory is also not working.

Hm, old dvc versions also don't work. ~It is actually because example-get-started is not dvc pull-able it all, probably some issues with the remote there.~ Need to check (and, well, need a better error on dvc side :smile: )

~example-get-started is currently using a private s3 url, instead of public http, probably someone did that by accident.~

~@shcheklein @jorgeorpinel Hm, looks like https://github.com/iterative/example-get-started/commits/master/.dvc/config didn't even have public remotes. Did someone force-push there? I do remember using that repo for get/import previously.~

For the record: indeed there were some force pushes, but current master has publically accesible remote and I can dvc pull from it, but can't dvc get for some reason, not from any dvc version actually, which is strange. Investigating...

Ok, so the issue is that https://github.com/iterative/example-get-started/blob/master/data/data.xml.dvc#L4 is a dvc-import-ed file by itself (i don't think it used to be that way). So dvc is not able to pull it (we didn't officially support chaining imports at all yet) and gives that strange error. Definitely need to at least error-out nicely.

More precisely: https://github.com/iterative/dvc/blob/14745039a4bfd35c34dc342bd5e6d324ecd52640/dvc/repo/tree.py#L105 collects only cache_info.external cache, which repo.cloud.pull is not able to process and just finishes as if there was nothing it needed to do. Def need to handle it in DvcTree as well.

Closing in favor of https://github.com/iterative/dvc/issues/3305 , since detecting the import chaining is very close to just adding the support for it (and error-ing out on cycles).

Ah, glad it's an issue with the repo only, and not DVC 馃槄

data.xml.dvc#L4 is a dvc-import-ed file by itself (i don't think it used to be that way). So dvc is not able to pull it

True! This is a change from our new Get Started (https://dvc.org/doc/start/data-access#import-file-or-directory)

I can confirm that both import and get work with https://github.com/iterative/dataset-registry get-started/data.xml.

gives that strange error. Definitely need to at least error-out nicely

Agree.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tc-ying picture tc-ying  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

siddygups picture siddygups  路  3Comments

shcheklein picture shcheklein  路  3Comments

nik123 picture nik123  路  3Comments