Hello,
I would like to use DVC to push data to my personal server. For security reasons, I decided to use key-based authentication and I changed the default SSH port. Currently (from my understanding), it doesn't seem possible to use DVC with such configuration. Would it be possible/hard to support this?
Hi @gcoter !
You are correct about the ssh port being not configurable right now. I've created https://github.com/iterative/dvc/issues/1060 to track the progress on that. I will send a patch very soon.
About the keys, did you mean an ssh keys at a custom location? Because dvc currently supports both password and default key for ssh. Just making sure I understand you correctly.
Thanks,
Ruslan
Hi @efiop,
Thank you for your quick answer! Yes, I mean an ssh key at a custom location.
Ah, got it. Thank you for clarifying! I'll repurpose this issue to track progress on 'configurable ssh key location'. ETA for both patches is today's evening :slightly_smiling_face:
Thanks,
Ruslan
Hi @gcoter !
Both patches are merged and released in 0.18.4. Here is a quick run-through:
$ dvc remote add -d myssh ssh://example.com:/path/to/dir
$ dvc remote modify myssh user gcoter
$ dvc remote modify myssh port 2222
$ dvc remote modify myssh keyfile /path/to/key
Please feel free to give it a try.
Thanks,
Ruslan
Hi @efiop !
Awesome! I'll give it a try soon.
Thank you 馃檪
Hi @efiop ,
I managed to push a small cache successfully to my server. However, I tried with an older project (the local cache takes about 200 MB) and it's been running for more than one hour with these logs:
Preparing to push data to ssh://example.com:/path/to/dir
[### ] 10% Collecting information
It seems stuck at this point. On the server, the cache is still empty. Is it normal?
Hi @gcoter !
Hm, looks like it is stuck on cache verification, which works locally. Looks like a bug. Could you please kill it with CTRL + C
and show the stacktrace that it will output please? And after that, could you try running push once again just to see if the problem persists?
Thanks,
Ruslan
Ok, so here are the error logs when I kill it:
[### ] 10% Collecting informationforrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source
libifcoremd.dll 00007FFD859294C4 Unknown Unknown Unknown
KERNELBASE.dll 00007FFDBA6C56FD Unknown Unknown Unknown
KERNEL32.DLL 00007FFDBBA83034 Unknown Unknown Unknown
ntdll.dll 00007FFDBE031431 Unknown Unknown Unknown
The problem persists when I run push again, and I have some additional logs at the beginning:
位 dvc push -v
Error: Traceback (most recent call last):
File "c:\anaconda3\lib\site-packages\dvc\state.py", line 83, in load
return json.load(fd)
File "c:\anaconda3\lib\json\__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "c:\anaconda3\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "c:\anaconda3\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\anaconda3\lib\json\decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 475600 (char 475599)
Error: Failed to load 'C:\Users\Guillaume COTER\projects\kaggle\kaggle-tgs-salt-identification-challenge\.dvc\state': Expecting ',' delimiter: line 1 column 475600 (char 475599)
Preparing to push data to ssh://example.com:/path/to/dir
[### ] 10% Collecting informationforrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source
libifcoremd.dll 00007FFD859294C4 Unknown Unknown Unknown
KERNELBASE.dll 00007FFDBA6C56FD Unknown Unknown Unknown
KERNEL32.DLL 00007FFDBBA83034 Unknown Unknown Unknown
ntdll.dll 00007FFDBE031431 Unknown Unknown Unknown
Looks like a deadlock somewhere in our state
logic, created https://github.com/iterative/dvc/issues/1081 to track investigation on that. Looking into it right now.
What was the method of installation in this case? Our binary package for windows, right?
Btw, could you please also try running dvc push
again but now with the --jobs 1
option(i.e. dvc push -j 1
)? If it is indeed a dead lock, limiting concurrency to 1 thread might be a viable workaround.
I still have the same problem with the --jobs 1
option. I installed DVC with pip. Should I try with the binary package for windows?
Oh :slightly_frowning_face: No, you don't need to try with the binary package. Thank you for the feedback! I'm investigating it right now.
Ok :) Thank you very much for your help!
Hi @gcoter !
I have released a 0.18.11 with a possible fix for this issue. Could you please upgrade and give it a try?
Thanks,
Ruslan
Hi @efiop !
I still have the same issue with the new version, and still this message about the state:
Error: Failed to load 'C:\Users\Guillaume COTER\projects\kaggle\kaggle-tgs-salt-identification-challenge\.dvc\state': Expecting ',' delimiter: line 1 column 770800 (char 770799)
So, I had a look at the state file and the end is a bit weird:
...: "1531761974000000000"}, "6291176889292512117": {"md5": "e18a47167b2c6ad8a6f1813dbb6a0038", "mtime": "1531761606000000000"}, "6291176837987249304": {"md5": "90b1379e1106c8c931fdcbd837754a6c", "mtime": "1531761818000000000"}, "6291176887684744497": {"md5": "215f86b9726fde7a5fb39676a02fdcc4", "mtime": "1531761758000000000"
It doesn't end with curly brackets, which is problematic I guess? I didn't have this issue before running dvc push
for the first time.
Hi @gcoter !
Hm, looks like the problem I've fixed was not the original one.
The error about the state file is expected since the previous dvc process was interrupted while writing to the state file, hence the unfinished braces.
To make it easier, would you mind packing the whole dir into an archive and sending it to me so I could hopefully reproduce the issue? It is totally understandable if you do mind doing that though, e.g. for privacy reasons.
Which directory do you need exactly?
It would be ideal if you could just pack the whole project. I.e. let's say you have a directory myrepo
that has your *.dvc
files, .dvc
directory etc, it would be great if you could pack myrepo
into an archive and send it to me for debugging.
Well, since it is a Kaggle related project, I do not mind sharing my code, except the connection parameters to my server (but I could mask them). However, the data can not be shared outside the competition.
I could try with another project and see whether it works.
No worries, totally understandable. :slightly_smiling_face: Lets go through a few more questions then to eliminate the variables:
1) Which python version are you using?
2) Could you try different python version and see if the problem persists(e.g. python2 and python3)?
3) Could you try installing our binary windows package and give it another try?
4) Could you try moving the project to the mac or linux machine and check if the issue persists?
Thanks,
Ruslan
I could try with another project and see whether it works.
That would be great too!
@gcoter Could you please tell how big the .dvc/state
file is(i.e. file size) after you interrupt dvc push
(leave it running for ~10 minutes or longer though)?
Also, does your data(i.e. files that are cached by dvc, so everyting you've dvc add
ed and everything you've specified with -o
for dvc run
) consists of directories with lots and lots of files? I'm asking these two last questins because I have a suspicion that state
file gets way to big and plus a json.dump() function slows everything down on top of that. Looks like we need to introduce a state file limit along with a limit for a file size that is going to be cached in the state file. Working on preparing a patch for that.
@gcoter I've released 0.18.12 with the fix for state file. Please feel free to try it out.
Thanks,
Ruslan
Hi @efiop !
Unfortunately, I don't have enough time to try everything you asked today. However, I can answer several of your questions:
Python 3.6.5 :: Anaconda custom (64-bit)
dvc add
ed several directories with quite a lot of files. For instance, the training set contains 8000 images (~45 MB in total).dvc/state
takes 377 KBI'll try the new version as soon as possible. Thank you!
No worries, thank you for the info!
I've tried to reproduce your environment earlier today but unfortunately was unable to reproduce the issue. Will see if the 0.18.12 makes any difference for you and will go from there.
Thanks,
Ruslan
I tried with the last version and the majority of the files were transfered! But it crashed for some reason:
(17174/23901): [##############################] 100% data/test/images\2675a2f9b6.png
(17175/23901): [##############################] 100% data/test/images\267d02b0fd.png
(17176/23901): [##############################] 100% data/test/images\267e4768a6.png
(17177/23901): [##############################] 100% data/test/images\267efbb1f8.png
(17178/23901): [##############################] 100% data/test/images\26831e1535.png
Error: Failed to push data to the cloud: Error reading SSH protocol banner[WinError 10054] Une connexion existante a d没 锚tre ferm茅e par l鈥檋么te distant
And if I try again, it resumes and then crashes again before finishing. I think it is related to my server, not to DVC. Maybe it is configured to limit the amount of data that can be transferred. Given the number of files, it might explain the problem.
Hi @gcoter !
Looks like the original issue has been resolved with the last patch, yay :tada: The issue you are experiencing now is indeed caused by the server not being able to serve so many jobs at the same time. This issue accidentally had been brought up today by another user with another type of remote server and it turned out that 8*NCPU
threads was way too much for his server to handle. I've just released 0.18.13 with a lower number of default threads(4*NCPU
) which should be easier on the server. Still, it might be too much for your particular case and if it is so please try setting a lower number of simultaneous jobs with --jobs N
option, e.g. dvc push -j 8
or even dvc push -j 4
.
Thanks,
Ruslan
@gcoter Also, looking at your logs I've noticed that paths are improperly joined(e.g. data/test/images\26831e1535.png
should be data\test\images\26831e1535.png
on windows). We might have a bug(unrelated to your issue) here, looking into it right now.
@gcoter Also, looking at your logs I've noticed that paths are improperly joined(e.g. data/test/images\26831e1535.png should be data\test\images\26831e1535.png on windows). We might have a bug(unrelated to your issue) here, looking into it right now.
Ah, looks like it is just how the name got resolved which is purely cosmetic and doesn't affect anything. I'll prepare a patch soon to fix that. Created https://github.com/iterative/dvc/issues/1095 for that.
Hi @efiop !
I'm having the same issue again: it hangs forever. Using the debugger, I managed to find which method is hanging: https://github.com/iterative/dvc/blob/0.18.13/dvc/remote/ssh.py#L108
Everything goes fine until this line. I can see the find
command running on my server. It takes a few seconds. Then, nothing happens. It seems that recv_exit_status
waits forever. I read that Paramiko can have trouble with large output:
In my case, the output of the find
command contains 21791 lines. I think that might be the cause of the issue. What do you think?
Hi @gcoter !
Thank you for a detailed analysis, this is extremely helpful! Indeed, looks like using find
to check if files exist in this case is not ideal. Created https://github.com/iterative/dvc/issues/1102 to track the progress on this issue. Will look into it ASAP.
Thanks,
Ruslan
Hi @gcoter !
I've merged the patch for that and released it under 0.18.14. Please feel free to give it a try. I've tried it myself with directories with >50K files and it works fine for me now.
Thanks,
Ruslan
Hi @efiop !
I still have the same issue with the new version, however I think I know how to reproduce the problem! I will post an explanation in a new issue.
Looking forward to it! May I ask for how long(approximately) did you leave it running before killing it?
Thanks,
Ruslan
Please have a look at #1104 for more details :)
Most helpful comment
Hi @gcoter !
Both patches are merged and released in 0.18.4. Here is a quick run-through:
Please feel free to give it a try.
Thanks,
Ruslan