Describe the bug
On execution of even a very basic code such as
import torch as th
import syft as sy
hook = sy.TorchHook(th)
alice = sy.VirtualWorker(hook, id="alice")
bob = sy.VirtualWorker(hook, id="bob")
crypto_provider = sy.VirtualWorker(hook, id="james")
a = torch.ones(1, 5)
a = a.encrypt(workers=[alice, bob], crypto_provider=crypto_provider)
a error is thrown when python shuts down.
Exception ignored in: <bound method ObjectPointer.__del__ of [PointerTensor | me:51874166013 -> alice:36055571191]>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/syft/generic/pointers/object_pointer.py", line 347, in __del__
File "/usr/local/lib/python3.6/dist-packages/syft/workers/base.py", line 267, in send_msg
File "/usr/local/lib/python3.6/dist-packages/syft/serde/serde.py", line 40, in serialize
ImportError: sys.meta_path is None, Python is likely shutting down
Exception ignored in: <bound method ObjectPointer.__del__ of [PointerTensor | me:83736240700 -> bob:18435096342]>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/syft/generic/pointers/object_pointer.py", line 347, in __del__
File "/usr/local/lib/python3.6/dist-packages/syft/workers/base.py", line 267, in send_msg
File "/usr/local/lib/python3.6/dist-packages/syft/serde/serde.py", line 40, in serialize
ImportError: sys.meta_path is None, Python is likely shutting down
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Python must terminate without throwing an error
@karlhigley not able to replicate the error, both in google colab and in local terminal.
I have seen this happen before, though it may not be consistent. I think the cause is PointerTensors or ObjectPointers going out of scope, having their __del__ method called (which attempts to send a message to the owner of the referenced object), and having that collide with Python shut down activities. There's not likely to be a quick fix for this though, because our GC implementation fundamentally relies on Python's GC, which offers no guarantees for when or whether the __del__ method will be executed.
Long term, the solution is probably to come up with a better way to do distributed GC.
@tudorcebere I've been thinking a lot about refactoring workers in order to make Protocols work, and I think we might want to consider having separate threads for message sending/receiving, message processing, and Plan/Protocol execution. If we did that, we'd need to be able to pass messages between threads, so we'd probably use thread-safe queues.
And if we had _that_, then maybe the __del__ hook that gets called for garbage collection when objects go out of scope could add a delete message to the queue instead of directly trying to serialize and send it. That would turn this issue into "Do we want to make sure outgoing delete messages are processed when Python shuts down? If so, how?"
Instead of sending outgoing delete messages when we shut down, it might make more sense for workers to GC remaining objects that came from workers they're no longer in contact with? Not sure, but seems possible.
Thinking maybe we should create a milestone for async/multi-threaded workers and assign this issue to that milestone. Anyone else have thoughts on that? I can't see a good way to address this without some form of concurrency, but that doesn't mean there isn't one. 馃
@karlhigley I think this might be an awesome idea, workers really need some love, I like the idea of making send and receive on separate threads (could this help async workers as well?). This could be a step forward the actor model as well and the stack forwarding project. (maybe we would like to stick with some custom actor model?). I am not familiar with the GC behavior, but in my mind, the idea of adding a del message when an object goes out of scope could work really nice, (should make everything more transparent as well).
The current GC behavior does send a delete message, but since our comms methods are currently synchronous and blocking, that means that garbage collection is a blocking operation. 馃檨
This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.
Most helpful comment
I have seen this happen before, though it may not be consistent. I think the cause is
PointerTensorsorObjectPointersgoing out of scope, having their__del__method called (which attempts to send a message to the owner of the referenced object), and having that collide with Python shut down activities. There's not likely to be a quick fix for this though, because our GC implementation fundamentally relies on Python's GC, which offers no guarantees for when or whether the__del__method will be executed.Long term, the solution is probably to come up with a better way to do distributed GC.