Edit: I found a solution to this. There is a bug in Paramiko. Skip to the last post for the important details.
I wrap all of my ssh connections inside the classes shown below. The classes are invoked like this:
s = SshTask()
s.timeout = s.run_timeout = 14
s.to_host = form.get_store('vmh01')
s.key = settings.KEY
s.command = " "
packages = s.run()
This should leave nothing behind. I scripted a page to repeatedly make the same request to the same server over and over again. I installed Dozer on my Django app. Debug mode is turned off. After a few hours of running my memory went form 100mb to 272mb. Paramiko seems to be the only thing that isn't cleaning up after itself.
Now that the app is idle this is what it is showing for counts:
paramiko.agent.Agent Min: 345 Cur: 345 Max: 345
paramiko.auth_handler.AuthHandler Min: 345 Cur: 345 Max: 345
paramiko.buffered_pipe.BufferedPipe Min: 690 Cur: 690 Max: 690
paramiko.channel.Channel Min: 345 Cur: 345 Max: 345
paramiko.client.AutoAddPolicy Min: 351 Cur: 351 Max: 351
paramiko.client.SSHClient Min: 351 Cur: 351 Max: 351
paramiko.ecdsakey._ECDSACurve Min: 3 Cur: 3 Max: 3
paramiko.ecdsakey._ECDSACurveSet Min: 1 Cur: 1 Max: 1
paramiko.hostkeys.HostKeyEntry Min: 345 Cur: 345 Max: 345
paramiko.hostkeys.HostKeys Min: 702 Cur: 702 Max: 702
paramiko.message.Message Min: 345 Cur: 345 Max: 345
paramiko.packet.Packetizer Min: 351 Cur: 351 Max: 351
paramiko.py3compat.long Min: 3 Cur: 3 Max: 3
paramiko.resource.ResourceManager Min: 1 Cur: 1 Max: 1
paramiko.rsakey.RSAKey Min: 690 Cur: 690 Max: 690
paramiko.ssh_exception.SSHException Min: 6 Cur: 6 Max: 6
paramiko.transport.ChannelMap Min: 351 Cur: 351 Max: 351
paramiko.transport.Transport Min: 351 Cur: 351 Max: 351
paramiko.util.PFilter Min: 1 Cur: 1 Max: 1
Edit: Narrowed this down more, see below post. Removed my implementation. Not relevant anymore.
Here is a screenshot of all related objects from a transport object that isn't being cleaned up.

Edit: Seems like this might only happen when the following exception is thrown:
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
Clean exiting connections seem to be cleaned up fine.
Edit: More research, I seem to have gotten to the bottom of this. I'm seeing that sometimes things will loop in read_all in the packetizer until a rekey exception is thrown. Everything seems to exit normally. For some reason the weak references in the resource manager don't seem to work as intended. You can see that the only thing holding these transport objects in memory is a weak reference. Even calling gc.collect() doesn't make them go away.

All these transports sit in memory, as well as the client connected to them, as well as their last message, and an EOF exception from the last_exception key, and a stack trace on that exception. - A bunch of crap that doesn't seem to go away after any amount of time.
The whole resource manager object seems to be needlessly complex, all to get around the common idiom that you shouldn't rely on the __del__ method. Well the ResourceManager does exactly what the __del__ method would do, except the __del__ method actually works.
If you get rid of the ResourceManager and just throw this on the bottom of Transport:
def __del__(self):
self.close()
this whole problem goes away and everything is cleaned up as soon as the rekey exception is thrown. This is on vanilla Python 3.6.
These monkey patches resolve the problem for me:
paramiko.transport.Transport.__del__ = lambda self: self.close()
paramiko.resource.ResourceManager.register = lambda *args: None
I found out why this is happening. It all has to do with the ResourceManager. The resource manager says when Client is deleted close the transport. The Client has a reference to transport. Transport has a reference to Client - thats ok. They cancel out.
But when an the transport is registered this method is created which references the Transport as resource:
(Transport == resource)
def callback(ref):
try:
resource.close()
except:
pass
del self._table[id(resource)]
It assigns this as a callback to be called when the Client is deleted. Now here is the issue. This function references Transport bumping Transport's reference counter up by 1.
As long as this callback exists is memory the Transport's reference counter will never be 0. This callback will exist in memory as long as the Client exists in memory. The Client will exist as long as self._sshclient is set to the Client on the transport.
The only place that self._sshclient is set to None on the Transport is in the close method. So if close() is never called on the transport (because of an exception) you have a cyclical reference that never goes away.
I can think of a few ways to fix this. I still think the best way is to get rid of the ResourceManager entirely. Simple is better than complex. A __del__ method on the Transport does the same thing as this.
It would be nice to hear back from someone. Multiple posts on this with no replies, I feel like I'm talking to myself.
This change was recently merged (cherry-picked): #891
That fixed some problems, but I think might have inadvertently caused this leak.
Thanks for the thorough analysis of the problem!
So this ResourceManager callback system is what caused the problem that #891 tried to fix: if a reference to the client was not being saved by the program using paramiko, but it was still using the transport, the client would be garbage collected, and then it would close the transport while the program was still using it. But the fix made the ResourceManager useless for this relationship. (And the ResourceManager isn't used anywhere else.)
It seems like the fix is to get rid of the ResourceManager, and figure out the GC situation anew. It's prudent to read why it was introduced in the first place: 029b8989dbdde898f0ee4e09e17153f4efa2e416
EDIT: fixed link
... I now notice your simple "monkey patch" fix, to just get rid of the ResourceManager and add a simple Transport destructor. That seems to make sense.
The ResourceManager does seem pointless. It seems to be nothing more than a SSHClient destructor, but with a complication which "tricks" the GC cycle detector. I hope you don't mind if I describe what happens here in my own way:
ResourceManagerResourceManager has reference to _table member_table has (normal/strong) reference to weakref objectweakref has a "weak" reference to SSHClient, does not contribute to SSHClient's reference countweakref has a (normal/strong) reference to the callback function!TransportSo the GC cycle detector doesn't see that the ResourceManager reference to the Transport is _really_ due to the SSHClient object, it looks like an unrelated reference from the ResourceManager singleton, it looks independent.
Back to what makes sense as the fix. Assuming we remove the ResourceManager: If the program using the SSHClient calls .close() on it, it will also .close() the transport - so correct programs will work correctly, right? We could leave the back-reference from Transport to SSHClient, either .close() or the cycle detector will clean up, but is the back-reference still needed for anything, can we actually remove that too? I think we can...
Yeah, that's exactly whats happening. It took me a day just thinking about how I got it to work right. When I was walking back to my car the second day it finally clicked why it was all happening. It was a real eureka moment.
So issue #44 was that originally the Client getting garbage collected caused close() to be called on the transport. If we just let transport close itself and don't do anything when the Client is garbage collected, we can get remove the ResourceManager and let the transport clean up on gc? Then everything is fine including #44.
There's another aspect to this:
The back-reference Transport._sshclient was cleared in the body of Transport.close(), but that would not run if the "SSHException: Error reading SSH protocol banner" was thrown, because that causes Transport.run() to clean up the Transport as it ends, and it clears Transport.active to prevent close() from having effect. But the .run() cleanup neglected to clear ._sshclient.
So if that exception was thrown, even if you called .close(), the Transport / SSHClient would still leak due to the ResourceManager. But if you called .close() before .run() ends, it un-sets ._sshclient and breaks the cycle.
Anyway, ResourceManager and ._sshclient should still be removed.
Yeah, thats what probably made it a hard bug to track down since normal execution wouldn't cause a leak.
Putting thoughts here since it's the issue and not the PRs, discussion going across PRs can be hard to track 馃檭
First, I haven't crawled the discussion and the patchsets in detail yet, apologies. Enough to get the high level gist of "reorganize classes/references so that GC can actually function as intended". Seems reasonable.
Second, @ploxiln's earlier PR mentioned porting the work back to 1.17+, which I agree with conceptually (memory leaks suck, such a bugfix should be made widely available). Unfortunately, these PRs seem to involve very large diffs - as per the first point - which is at odds with merging to stable, bugfix oriented branches. (Implicit: because the more code you move around the more instability may result.)
So I'm very of two minds on that point. Not sure how to square it...perhaps apply to a 2.2 release, wait for any unintended side effects to fall out, _then_ backport? Or suck it up and say no, you don't get this on old versions, please modernize and roll forwards. (Which is increasingly how we need to do things re: cryptography support, possibly dropping python 2.6, etc.)
Opinions welcome on that.
Third, thanks to @ploxiln and @agronick for doing all of this heavy lifting, I super appreciate it! I would definitely like to get this merged in the shorter term even if it means making unpopular decisions re: which branches get it 馃榿
I think #952 should be merged first, and backported. It's the smaller fix, and rather simple. Without it, it's possible to have these leaks even if .close() is called on the Transport or SSHClient. With it, if you make sure .close() is always eventually called, you can avoid leaks.
Then, I think #964 should be validated, and then merged to master. That is the fix that makes GC work to close Transport automatically when it is truly no longer used. We could wait for people who need to use 1.18 to report problems before backporting this one, IMHO.
Right on, I'd forgotten about #952 and that is definitely much easier to swallow as a backport. Thanks!
Looking at this issues and many others like it.. I'm wondering if there is a test in that can be run to asses if there are still memory leaks and where the memory leaks are at.
Is there a test that already exists? Or should I potentially look at writing one?
Is there any workaround to avoid this failure. we are hitting this issue very frequently.
Thanks,
Nixit
If you ensure Transport.close() or SSHClient.close() is always eventually called, e.g. using try/finally, then you can avoid leaks.
The original issue still stands. If you never call .close() then it is likely that there will be connections and memory that the python reference counting and garbage collector will never be able to clean up.
@ploxiln
Is the memory leak related to only Transport? For those users who are using only a single client which spawns processes, should those users be creating multiple clients, one for each process, instead of keeping a single client from which many processes are spawned?
Do the memory leaks affect long running processes at all?
An SSHClient uses a Transport - usually just one. There should not be a problem using a client to do multiple things, or long-running things.
It looks like this bug is totally exacerbated by the following python bug:
https://bugs.python.org/issue39074
IE. Not only does paramiko leak the thread, but the leaked threads themselves will also leak memory.
ALWAYS CALL CLOSE on YOUR CLIENTS
Except calling .close only leads to race conditions and infinite blocking due to cyclic references and other issues in the library.
Most helpful comment
I found out why this is happening. It all has to do with the ResourceManager. The resource manager says when Client is deleted close the transport. The Client has a reference to transport. Transport has a reference to Client - thats ok. They cancel out.
But when an the transport is registered this method is created which references the Transport as resource:
(Transport == resource)
It assigns this as a callback to be called when the Client is deleted. Now here is the issue. This function references Transport bumping Transport's reference counter up by 1.
As long as this callback exists is memory the Transport's reference counter will never be 0. This callback will exist in memory as long as the Client exists in memory. The Client will exist as long as self._sshclient is set to the Client on the transport.
The only place that self._sshclient is set to None on the Transport is in the close method. So if close() is never called on the transport (because of an exception) you have a cyclical reference that never goes away.
I can think of a few ways to fix this. I still think the best way is to get rid of the ResourceManager entirely. Simple is better than complex. A
__del__method on the Transport does the same thing as this.It would be nice to hear back from someone. Multiple posts on this with no replies, I feel like I'm talking to myself.