I get an "unable to find DVC-file" error when trying to download a file from a private repository.
This is the command:
dvc get https://github.com/private_repo/my_repo/ model.pkl --rev branchname
This is the error:
ERROR: failed to get 'model.pkl' from 'https://github.com/private_repo/my_repo/' - unable to find DVC-file with output 'model.pkl'
This is the content of the .dvc file with the model.pkl output:
cmd: python train.py
outs:
- path: resources/model.pkl
metric: false
cache: true
persist: false
md5: a933f14a1466d27382a8a265eacd3034
- path: resources/evaluation.json
metric: true
cache: false
persist: false
md5: 286ea699ee07c4de3a689d48fd5e677b
deps:
- path: resources/X_train.pkl
md5: e8f9fecdab4411d6a5ad4b09cc6b7821
- path: resources/y_train.pkl
md5: 107fb604950f2099f892dc3a5898f50b
- path: resources/X_test.pkl
md5: 3d00a8e4aab1670398757e2c76850f8d
- path: resources/y_test.pkl
md5: b249173c50595c1f9663f40a057309e8
md5: 91f945a0209e673487e61637089aaab4
The path of the above .dvc file in the repo is: my_repo/my_service/service_train.dvc
System info (dvc installed with conda):
DVC version: 0.66.3
Python version: 2.7.15
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: False
Any ideas regarding the possible reasons behind the error would be highly appreciated.
@ValdarT thanks for the report! have you tried to do dvc get repo resources/model.pkl instead?
Thanks for the prompt reply. I did try that, it gave the same result.
kk, could you please run it with -v, please? Could you also try do the same w/o specifying --rev if that file exists in the master branch of course.
@ValdarT I've tried to reproduce it with a simple scenario when I have two branches with an output and I use --rev to specify one of those branches. It worker well for me. It would be great to see more info (debug log with -v) like I mentioned, or may be a simple scenario how can we reproduce this.
dvc get https://github.com/my_private_repo/my_repo/ model.pkl --rev branchname -v
DEBUG: Writing '/private/var/folders/g5/0pk5g7hj1dx3f_95b04x1tph0000gn/T/tmpG4QztZdvc-erepo/.dvc/config.local'.
DEBUG: Writing '/private/var/folders/g5/0pk5g7hj1dx3f_95b04x1tph0000gn/T/tmpG4QztZdvc-erepo/.dvc/config'.
DEBUG: Removing '.JSaFog4tzwoREktJXKwF6j'
ERROR: failed to get 'resources/model.pkl' from 'https://github.com/my_private_repo/my_repo/' - unable to find DVC-file with output 'model.pkl'
------------------------------------------------------------
Traceback (most recent call last):
File "/Users/username/anaconda3/envs/py2/lib/python2.7/site-packages/dvc/command/get.py", line 22, in run
rev=self.args.rev,
File "/Users/username/anaconda3/envs/py2/lib/python2.7/site-packages/dvc/repo/get.py", line 67, in get
raise OutputNotFoundError(path)
OutputNotFoundError: unable to find DVC-file with output 'model.pkl'
The same happens when I try without specifying the branch.
Could it be related to the repository structure (i.e., to where .dvc files are located in the folders) or is it possible to somehow corrupt the state in a way that could result in something like this?
I'll try to find time tomorrow to come up with a way to reproduce this.
Ok, I think I know the reason for this - you need to specify a full relative (to project root) path of the file - how would it looks like after you do pull/checkout, in this case:
my_service/resources/model.pkl
@ValdarT please, try that let me know if that works.
I was able to reproduce a different issue with a file that is not cached though:
$ dvc import https://github.com/shcheklein/example-get-started.git auc.metric
Importing 'auc.metric (https://github.com/shcheklein/example-get-started.git)' -> 'auc.metric'
WARNING: Cache '0eaa29dc9b8c89bb9ba1348b3c3cc772' not found. File 'auc.metric' won't be created.
ERROR: failed to import 'auc.metric' from 'https://github.com/shcheklein/example-get-started.git'. - output 'auc.metric' does not exist
cc @efiop do you if we have implemented already the logic to import/get non cached files? Does it look like a bug or am I missing something?
@shcheklein We didn't yet, only API supports non-cached right now.https://github.com/iterative/dvc/issues/2515
This solves it, indeed. I find it a bit surprising because in general dvc doesn't care about paths like this so changing this behaviour could be a possible UX improvement.
In any case, everything works. Thank you very much for helping me out here, @shcheklein!
@ValdarT yes, the reason for DVC relying on the actual path, not on DVC-file outputs is because output path in DV C-file is relative and it's possible that two or more DVC-files have the same output path values (while files are located in different subdirectories).
It might be a good addition though to signal a hint if such a situation is detect - there is a DVC-file(s) that has an output path value that matches the provided in dvc get CLI? Feel free to open a ticket for that - it might be a good first one!