Pysyft: model.get() causes RuntimeError if the code is running on GPU with resnet model

Created on 18 Sep 2020  路  10Comments  路  Source: OpenMined/PySyft

Description

After I train the model locally by a worker, I do model.get() to retrieve it and I have the following runtime error: "Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_set_".
I am training on GPU (same code runs perfectly if I use CPU) and I am using resnet50 model.

How to Reproduce

optimizer = optim.Adam(model.parameters(), lr=lr) 
criterion = nn.CrossEntropyLoss()

model.train()
model.send(worker)
for batch_idx, (data, target) in enumerate(batches):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        loss = loss.get()
        model.get()  # <-- This get causes the error

System Information

  • syft: 0.2.9
0.2.x Type hacktoberfest

Most helpful comment

Thank you for reporting it! It might be a problem that the tensors should be sent to the device (in this case GPU) before using set_

All 10 comments

Hey, thanks for reporting!
can you alors provide your stack trace please? :)

Hi @Hjeljeli @LaRiffle, I encountered the same problem a few months ago, please refer to the comments in #3848 for further explanation.

@LaRiffle
Hi Theo, please find my stack trace. Merci pour ton aide!

<ipython-input-16-1a450dc3c5d4> in <module>
      3 logging.basicConfig(format=FORMAT, level=LOG_LEVEL)
      4 
----> 5 main()

<ipython-input-15-443eb06bbc7c> in main()
    208     for epoch in range(1, epochs + 1):
    209         logger.warning("Starting epoch %s/%s", epoch, epochs)
--> 210         model = train(model, device, federated_train_loader, test_loader, lr, federate_after_n_batches)
    211         test(model, device, test_loader)

<ipython-input-15-443eb06bbc7c> in train(model, device, federated_train_loader, test_loader, lr, federate_after_n_batches, abort_after_one)
    112             curr_batches = batches[worker]
    113             if curr_batches:
--> 114                 local_models[worker] = train_on_batches(worker, curr_batches, model, device, test_loader, lr)
    115 
    116             else:

<ipython-input-15-443eb06bbc7c> in train_on_batches(worker, batches, model_in, device, test_loader, lr)
     42             t1 = time.time()
     43             # We measure accurancy of worker's model
---> 44             model.get()
     45             accuracy = test(model, device, test_loader)
     46             accuracies[worker].append(accuracy)

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/hook/hook.py in module_get_(nn_self)
    669             for element_iter in tensor_iterator(nn_self):
    670                 for p in element_iter():
--> 671                     p.get_()
    672 
    673             if isinstance(nn_self.forward, Plan):

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/tensors/interpreters/native.py in get_(self, *args, **kwargs)
    685         Calls get() with inplace option set to True
    686         """
--> 687         return self.get(*args, inplace=True, **kwargs)
    688 
    689     def allow(self, user=None) -> bool:

/usr/local/lib/python3.7/dist-packages/syft-0.2.7-py3.7.egg/syft/frameworks/torch/tensors/interpreters/native.py in get(self, inplace, user, reason, *args, **kwargs)
    672 
    673         if inplace:
--> 674             self.set_(tensor)
    675             if hasattr(tensor, "child"):
    676                 self.child = tensor.child

RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_set_

+1, I also run into this issue with very similar code to the above, except using mobilenet instead of resnet. RuntimeError message is the same. The issue disappears when I use a neural net I specify for myself, so I think it could be interop with the torch model zoo models?

(for reference I had this same issue back in ~May on syft ~0.2.4 but didn't report- unfortunately some other projects pulled me away from this one)

Thank you for reporting it! It might be a problem that the tensors should be sent to the device (in this case GPU) before using set_

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.

Updating so this issue is active.

Hello! Just letting you know that we are no longer planning on supporting anything on the 0.2.x product line and that all work should be ported over to 0.3.x, which is considered a complete rebuild of PySyft. Because of that, I'll be closing this issue. If you feel this is a mistake, or if the issue actually applies to 0.3.x as well, please feel free to ping me on Slack and I'll reopen the issue.

is this issue fixed in 0.3.x? @Hjeljeli. Currently stuck with this bug in 0.2.9 :(.

@naveenggmu Hi Naveen, I could not test on 0.3.x as you know the releases 0.3.x do not support all the privacy-preserving techniques that 0.2.x used to support, for me I am using PySyft for Federated Learning.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

swaroopch picture swaroopch  路  4Comments

samsontmr picture samsontmr  路  3Comments

akirahirohito picture akirahirohito  路  3Comments

beatrizsmg picture beatrizsmg  路  4Comments

LaRiffle picture LaRiffle  路  3Comments