Pysyft: Adam causes errors in training loop

Created on 15 Apr 2019 · 21Comments · Source: OpenMined/PySyft

[Related to #1909]

I tracked down the error to
'optimizer = optim.Adam(model.parameters(), lr=1e-3)' in my code, 'optimizer = optim.SGD(model.parameters(), lr=1e-3)' works.

My guess is that this is related to me using PyTorch 1.0 .

The code can be found in https://github.com/2fasc/Distributed_Malaria/blob/master/src/federated_training.py

Traceback (most recent call last):
  File "/home/fasc/Documents/Distributed_Malaria/src/federated_training.py", line 143, in <module>
    simple_federated_model()
  File "/home/fasc/Documents/Distributed_Malaria/src/federated_training.py", line 66, in simple_federated_model
    federated=True,
  File "/home/fasc/Documents/Distributed_Malaria/src/auxiliaries.py", line 111, in train
    optimizer.step()
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/torch/optim/adam.py", line 94, in step
    exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 650, in overloaded_native_method
    response = method(*new_args, **new_kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 486, in overloaded_pointer_method
    response = owner.send_command(location, command)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 364, in send_command
    _ = self.send_msg(MSGTYPE.CMD, message, location=recipient)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 198, in send_msg
    bin_response = self._send_msg(bin_message, location)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/virtual.py", line 6, in _send_msg
    return location._recv_msg(message)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/virtual.py", line 9, in _recv_msg
    return self.recv_msg(message)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 229, in recv_msg
    response = self._message_router[msg_type](contents)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 316, in execute_command
    getattr(_self, command_name)(*args, **kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 636, in overloaded_native_method
    raise route_method_exception(e, self, args, kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 630, in overloaded_native_method
    response = method(*args, **kwargs)
TypeError: addcmul_() takes 2 positional arguments but 3 were given

Type

Source

schimpffabian

Most helpful comment

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

to this line

exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))

and this line

p.data.addcdiv_(-step_size, exp_avg, denom)

to this line

p.data.add_(exp_avg.div(denom).mul_(-step_size))

made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

wytyang00 on 6 Jul 2019

👍4

All 21 comments

Hmm - it's not totally clear to me why this would be the case. Will require a deeper look. I don't think we have any unit tests around Adam so this might just require new functionality to get working.

iamtrask on 17 Apr 2019

I'm looking at it right now
First observation: Adam optim works in the setting of https://github.com/OpenMined/PySyft/blob/dev/examples/tutorials/Part%208%20-%20Federated%20Learning%20on%20MNIST%20using%20a%20CNN.ipynb if you replace the SGD optim with Adam.

LaRiffle on 17 Apr 2019

Oh actually it fails, but after a certain number of batch iterations

Train Epoch: 1 [0/60032 (0%)]   Loss: 2.303720
Train Epoch: 1 [1920/60032 (3%)]    Loss: 3.148396
Train Epoch: 1 [3840/60032 (6%)]    Loss: 4.953410
Train Epoch: 1 [5760/60032 (10%)]   Loss: 6.552952
Train Epoch: 1 [7680/60032 (13%)]   Loss: 8.274794
Train Epoch: 1 [9600/60032 (16%)]   Loss: 9.585523
Train Epoch: 1 [11520/60032 (19%)]  Loss: 11.709550
Train Epoch: 1 [13440/60032 (22%)]  Loss: 14.805451
Train Epoch: 1 [15360/60032 (26%)]  Loss: 14.508088
Train Epoch: 1 [17280/60032 (29%)]  Loss: 18.298759
Train Epoch: 1 [19200/60032 (32%)]  Loss: 16.543806
Train Epoch: 1 [21120/60032 (35%)]  Loss: 19.365683
Train Epoch: 1 [23040/60032 (38%)]  Loss: 23.546900
Train Epoch: 1 [24960/60032 (42%)]  Loss: 26.782951
Train Epoch: 1 [26880/60032 (45%)]  Loss: 28.923700
Train Epoch: 1 [28800/60032 (48%)]  Loss: 28.413818
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

<ipython-input-6-72ddef6ce5b6> in train(args, model, device, federated_train_loader, optimizer, epoch)
      8         loss = F.nll_loss(output, target)
      9         loss.backward()
---> 10         optimizer.step()
     11         model.get() # <-- NEW: get the model back
     12         if batch_idx % args.log_interval == 0:

~/code/env/pysyft/lib/python3.7/site-packages/torch/optim/adam.py in step(self, closure)
     92                 # Decay the first and second moment running average coefficient
     93                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
---> 94                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
     95                 if amsgrad:
     96                     # Maintains the maximum of all 2nd moment running avg. till now

~/code/PySyft/syft/frameworks/torch/hook.py in overloaded_native_method(self, *args, **kwargs)
    650                 # Send the new command to the appropriate class and get the response
    651                 method = getattr(new_self, method_name)

LaRiffle on 17 Apr 2019

I suspect there is a bug in the hooking as the loss is increasing

LaRiffle on 17 Apr 2019

Ok here is the trouble: Adam uses momentum (actually: second moments of the gradients), which means it stores the gradients in a list and for each batch produces a correction of the current gradient based on the old ones. When changing of batch owner (so in Part 8 of tutorial at the middle of the epoch), you have now gradients from alice which you want to correct with moments of old gradients owned by bob: this is not possible as the data needs to be at the same location and it raises an error, which is here a bit tricky to find.
This is also why momentum is not supported so far on SGD.

The fix for this would probably imply to rewrite the optimizers. This is an important project and maybe could be correlated to the notions of aggregator needed for Federated or Secure Averaging.

Thank you for reporting the error!

LaRiffle on 17 Apr 2019

👍1

Is there a way for us to perhaps clear out the momentum as needed? (reset to None)

iamtrask on 17 Apr 2019

Hey @2fasc - seems like the solution here is to have different optimizers for different machines (one for bob, another for alice, etc.) Eventually we'll write custom Federated optimizers but for now this is the solution :)

iamtrask on 17 Apr 2019

👍4

Fastest way to do this is probably to have a dict of optimizers, the key being the data.location, this would keep the code simple

LaRiffle on 17 Apr 2019

Just wondering, is this related to #1909 ?

mari-linhares on 17 Apr 2019

@mari-linhares exactly!

LaRiffle on 25 Apr 2019

Yes, according @iamtrask , you can write a optimizer list like Part4 or Par10, then you can use Adam, Adadelta or other optimizer with momentum. @2fasc

ProfXGiter on 2 Jul 2019

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

to this line

exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))

and this line

p.data.addcdiv_(-step_size, exp_avg, denom)

to this line

p.data.add_(exp_avg.div(denom).mul_(-step_size))

made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

wytyang00 on 6 Jul 2019

👍4

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
to this line
exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))
and this line
p.data.addcdiv_(-step_size, exp_avg, denom)
to this line
p.data.add_(exp_avg.div(denom).mul_(-step_size))
made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

cool

ProfXGiter on 9 Jul 2019

When this is fixed remember to remove the TODO at https://github.com/OpenMined/PySyft/blob/319d59c9aa3ab9b8dfda95c42f7dd5978b3bc48a/examples/tutorials/advanced/websockets-example-MNIST/run_websocket_client.py#L55

https://github.com/OpenMined/PySyft/blob/ac3d6172c268e26bf0a7100c6ee0a7f4cc4d8890/examples/tutorials/advanced/Federated%20CIFAR10.ipynb#L465

https://github.com/OpenMined/PySyft/blob/2776cffc619dacf9486584f6e455104d80901928/examples/tutorials/Part%2006%20-%20Federated%20Learning%20on%20MNIST%20using%20a%20CNN.ipynb#L378

mari-linhares on 8 Aug 2019

Just a small report, I'm running into a similar error, but it appears a few lines earlier:

exp_avg.mul_(beta1).add_(1 - beta1, grad)

The following error appears:

TypeError: add_() takes 1 positional argument but 2 were given

So it's to be a more general problem than only specific functions like addcmul_() and addcdiv_()

fsonntag on 28 Aug 2019

👍2

Just a small report, I'm running into a similar error, but it appears a few lines earlier:
exp_avg.mul_(beta1).add_(1 - beta1, grad)
The following error appears:
TypeError: add_() takes 1 positional argument but 2 were given
So it's to be a more general problem than only specific functions like addcmul_() and addcdiv_()

I encountered an error in the same line, but it raises

syft.exceptions.PureFrameworkTensorFoundError

Syft 0.1.26a1
Torch 1.1.0

JpMCarrilho on 19 Nov 2019

Any solutions yet?

aditya-malte on 17 Mar 2020

Yes - the solution was the FL Optimizer project. We can close this issue.

On Tue, Mar 17, 2020 at 12:49 PM Aditya Malte notifications@github.com
wrote:

Any solutions yet?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenMined/PySyft/issues/2070#issuecomment-600052348,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABBAZEQSAB62MQJMRJT2MQTRH5WW5ANCNFSM4HF52S2Q
.

iamtrask on 18 Mar 2020

Could you give a link please?

aditya-malte on 18 Mar 2020

@aditya-malte
https://github.com/OpenMined/PySyft/issues/3141
https://github.com/OpenMined/PySyft/pull/3179/files

fsonntag on 19 Mar 2020

Yes - the solution was the FL Optimizer project. We can close this issue.
…
On Tue, Mar 17, 2020 at 12:49 PM Aditya Malte @.*> wrote: Any solutions yet? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2070 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBAZEQSAB62MQJMRJT2MQTRH5WW5ANCNFSM4HF52S2Q .

Alright, I'll close the issue then

schimpffabian on 24 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings