Pysyft: model.get() causes RuntimeError if the code is running on GPU with resnet model

Created on 18 Sep 2020 · 10Comments · Source: OpenMined/PySyft

Description

After I train the model locally by a worker, I do model.get() to retrieve it and I have the following runtime error: "Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_set_".
I am training on GPU (same code runs perfectly if I use CPU) and I am using resnet50 model.

How to Reproduce

optimizer = optim.Adam(model.parameters(), lr=lr) 
criterion = nn.CrossEntropyLoss()

model.train()
model.send(worker)
for batch_idx, (data, target) in enumerate(batches):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        loss = loss.get()
        model.get()  # <-- This get causes the error

System Information

syft: 0.2.9

0.2.x Type hacktoberfest

Source

Hjeljeli

Most helpful comment

Thank you for reporting it! It might be a problem that the tensors should be sent to the device (in this case GPU) before using set_

gmuraru on 1 Oct 2020

👍2

All 10 comments

Hey, thanks for reporting!
can you alors provide your stack trace please? :)

LaRiffle on 18 Sep 2020

Hi @Hjeljeli @LaRiffle, I encountered the same problem a few months ago, please refer to the comments in #3848 for further explanation.

KCC13 on 19 Sep 2020

👍1

@LaRiffle
Hi Theo, please find my stack trace. Merci pour ton aide!

<ipython-input-16-1a450dc3c5d4> in <module>
      3 logging.basicConfig(format=FORMAT, level=LOG_LEVEL)
      4 
----> 5 main()

<ipython-input-15-443eb06bbc7c> in main()
    208     for epoch in range(1, epochs + 1):
    209         logger.warning("Starting epoch %s/%s", epoch, epochs)
--> 210         model = train(model, device, federated_train_loader, test_loader, lr, federate_after_n_batches)
    211         test(model, device, test_loader)

<ipython-input-15-443eb06bbc7c> in train(model, device, federated_train_loader, test_loader, lr, federate_after_n_batches, abort_after_one)
    112             curr_batches = batches[worker]
    113             if curr_batches:
--> 114                 local_models[worker] = train_on_batches(worker, curr_batches, model, device, test_loader, lr)
    115 
    116             else:

<ipython-input-15-443eb06bbc7c> in train_on_batches(worker, batches, model_in, device, test_loader, lr)
     42             t1 = time.time()
     43             # We measure accurancy of worker's model
---> 44             model.get()
     45             accuracy = test(model, device, test_loader)
     46             accuracies[worker].append(accuracy)

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/hook/hook.py in module_get_(nn_self)
    669             for element_iter in tensor_iterator(nn_self):
    670                 for p in element_iter():
--> 671                     p.get_()
    672 
    673             if isinstance(nn_self.forward, Plan):

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/tensors/interpreters/native.py in get_(self, *args, **kwargs)
    685         Calls get() with inplace option set to True
    686         """
--> 687         return self.get(*args, inplace=True, **kwargs)
    688 
    689     def allow(self, user=None) -> bool:

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/tensors/interpreters/native.py in get(self, inplace, user, reason, *args, **kwargs)
    672 
    673         if inplace:
--> 674             self.set_(tensor)
    675             if hasattr(tensor, "child"):
    676                 self.child = tensor.child

RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_set_

Hjeljeli on 21 Sep 2020

+1, I also run into this issue with very similar code to the above, except using mobilenet instead of resnet. RuntimeError message is the same. The issue disappears when I use a neural net I specify for myself, so I think it could be interop with the torch model zoo models?

(for reference I had this same issue back in ~May on syft ~0.2.4 but didn't report- unfortunately some other projects pulled me away from this one)

roblewis1237 on 30 Sep 2020

👍1

Thank you for reporting it! It might be a problem that the tensors should be sent to the device (in this case GPU) before using set_

gmuraru on 1 Oct 2020

👍2

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.

github-actions[bot] on 13 Nov 2020

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.

Updating so this issue is active.

Hjeljeli on 13 Nov 2020

Hello! Just letting you know that we are no longer planning on supporting anything on the 0.2.x product line and that all work should be ported over to 0.3.x, which is considered a complete rebuild of PySyft. Because of that, I'll be closing this issue. If you feel this is a mistake, or if the issue actually applies to 0.3.x as well, please feel free to ping me on Slack and I'll reopen the issue.

cereallarceny on 19 Nov 2020

is this issue fixed in 0.3.x? @Hjeljeli. Currently stuck with this bug in 0.2.9 :(.

naveenggmu on 27 Dec 2020

@naveenggmu Hi Naveen, I could not test on 0.3.x as you know the releases 0.3.x do not support all the privacy-preserving techniques that 0.2.x used to support, for me I am using PySyft for Federated Learning.

Hjeljeli on 28 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings