Dvc: Secsh channel 10 open FAILED: open failed: Connect failed

Created on 16 Mar 2020  路  20Comments  路  Source: iterative/dvc

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

version 0.88.0
installed via pip on Linux and Mac.
Syncing to a Ubuntu host via ssh.

Everything seems to be working but I get a cryptic warning message every time I do anything. For example:

dvc push
0% Querying cache in ssh://[email protected]/media/sda2/dvc/first_day| |0/42 Secsh channel 10 open FAILED: open failed: Connect failed
Secsh channel 10 open FAILED: open failed: Connect failed
Secsh channel 10 open FAILED: open failed: Connect failed
2% /media/sda2/dvc/first_day/4e/4b31f0c5784a2e185d88a3120cac19| |1/42 [00:02<0Secsh channel 10 open FAILED: open failed: Connect failed
Everything is up to date.

This is probably an edge case due to my setup but I'm not sure how to quiet the message or resolve the issue.

enhancement p2-medium

Most helpful comment

I'll look into this, although out of the recent patches, I don't think any of the relevant changes should have affected ssh remotes, since RemoteSSH still has its own overridden cache_exists implementation.

All 20 comments

@brorfred, can you paste the dvc push -v verbose output? Also, can you try with a single job and check?

dvc push -j 1

Thanks for looking into this!

Running the following command:

dvc push -v -j 1 > verbose_push_one_job.log

Generates the following output:

  0% Querying cache in ssh://[email protected]/media/sda2/dvc/first_day|                                                                                                                                   |0/51 [00:00<?,     ?file/s]
Secsh channel 10 open FAILED: open failed: Connect failed

And the log file:
verbose_push_one_job.log

@brorfred, My google-fu returned this answer from Unix StackExchange: https://unix.stackexchange.com/a/459051

It basically says that this happens if you get over resource limit, eg: no. of open file descriptors.
Please give that link a try, and also increase the ulimit.

And, please respond if it solved the issue as it will help other users as well.

I'm starting to realize that the error probably is some annoying kitchen sink response that can be pretty much anything. Most of my resource limits are set to infinite:

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256902
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 20000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256902
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The strange thing is that scp works fine:

touch test
scp test [email protected]:/media/sda2/dvc/first_day/
test                                100%    0     0.0KB/s   00:00 

@brorfred, looks like the issue is on server side. Closing it for now, feel free to re-open if you find that the issue is on dvc side.

It seems like the problem is in paramico and particularly in paramiko.ssh_exception.ChannelException. The class SSHConnection in dvc.remote.ssh.connection assumes that the exception fails quietly in open_max_sftp_channels but paramiko logs a warning standard error. Not sure if this is something specific with our setup but I get the same behavior on my Mac laptop...

from dvc.remote.ssh import connection

ssh = connection.SSHConnection(host="zotac.bror.us", username="bror", port=22)
ssh.open_max_sftp_channels()                                          
Secsh channel 10 open FAILED: open failed: Connect failed
Out[122]: 
[<paramiko.sftp_client.SFTPClient at 0x7fe1321fac50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321be150>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132239990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132202950>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321fa050>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219ed50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219df90>,
 <paramiko.sftp_client.SFTPClient at 0x7fe137584850>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13218f990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13220a4d0>]

#Redo the command 

ssh.open_max_sftp_channels()                                          
Out[123]: 
[<paramiko.sftp_client.SFTPClient at 0x7fe1321fac50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321be150>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132239990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132202950>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321fa050>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219ed50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219df90>,
 <paramiko.sftp_client.SFTPClient at 0x7fe137584850>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13218f990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13220a4d0>]

Yes, the text is sent to a logger object in the paramiko instance in transport.py line 2700. Would it be possible for dvc to replace that logger with your own?

Pretty strange that that log message even comes up like that. Dvc disables all other loggers in CLI https://github.com/iterative/dvc/blob/0.88.0/dvc/main.py#L37 , so that error shouldn't really show up at all. Maybe paramiko sets the logger back up during the runtime, causing its errors to show up again. :thinking:

We load paramiko lazily, so, I guess, the logger is not active at the point when we try to disable it. Probably, we need to bring back the logger dict configs that were before for paramiko.

I found a few discussions regarding this specific log message when it was implemented:

https://github.com/iterative/dvc/pull/2131#discussion_r293829238
https://github.com/iterative/dvc/pull/2131#discussion_r294023869

Of course, we need to silent the log, but also need to see why the message is there.

The message in itself is benign I think. You poll paramiko for max number of channels by open them one by one until max number is reached and paramiko raises an exception. I still don't understand why I'm the only one that get's the warning message though. I've tried two different setups (my office linux machine and my Mac laptop) and both generates it. Could it be because I've installed dvc in a conda environment? I tried to create a minimal environment with only pip paramiko and dvc installed and got the same warnings...

@brorfred Not sure, maybe it is a server-related thingy. Still, the fact that paramiko messages sip through the cracks is not great, we need to disable it when we import it or maybe adjust logging configs somehow.

Our MaxSession was set to a large number, and I found hundreds of sessions are opened for dvc push -j 1 and failed.

After reverting to v0.41.3, which is before #2131 was merged, we could now use dvc push

@hg-zt dvc push -j 1 affects the number of workers, but we still try to open as much sftp channels as we can. Did it fail without uploading anything at all? Usually, you should be able to dvc push again to upload the rest.

When doing dvc pull or dvc push, I get an error, same as above :
0% Querying cache in ssh://user@dvc_machine:path/to//dvc_remote| |0/1 [00:00<?, [837/1980] Secsh channel 30 open FAILED: open failed: Connect failed

  • There is enough space on the machine and I am able to ssh and scp files to the dvc remote.
  • Please let me know if any further details are needed.
  • Couldn't find much info when using -v flag.
  • I tried setting the ulimit open files to a higher number, but in vain.
  • Setting MaxSessions to higher limits didn't work.
  • I also tried the same with another machine as dvc remote. But again I get the same error.

Details:
DVC version: 0.93.0
Python version: 3.7.1
Platform: Linux-5.3.0-46-generic-x86_64-with-debian-buster-sid
Binary: True
Package: deb
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda1')

@gurunath-p Thanks for the report! A workaround is to run dvc push a few more times, it will upload what is missing.

But since we are getting more reports it makes me think whether some of our recent patches have made dvc go over the reasonable limit. CC @pmrowla

I'll look into this, although out of the recent patches, I don't think any of the relevant changes should have affected ssh remotes, since RemoteSSH still has its own overridden cache_exists implementation.

@efiop

Retried for a few more times and the rest was uploaded. I also found when running dvc push without any actual changes (everything up-to-date), I still get these errors.

I think the issue is just that we are not suppressing the paramiko logger. We hit the channel limit when determining the max number of channels and trigger the message, even though it's expected and not an error for DVC.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

robguinness picture robguinness  路  3Comments

tc-ying picture tc-ying  路  3Comments

mfrata picture mfrata  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

GildedHonour picture GildedHonour  路  3Comments