Pysyft: Adam causes errors in training loop

Created on 15 Apr 2019  Â·  21Comments  Â·  Source: OpenMined/PySyft

[Related to #1909]

I tracked down the error to
'optimizer = optim.Adam(model.parameters(), lr=1e-3)' in my code, 'optimizer = optim.SGD(model.parameters(), lr=1e-3)' works.

My guess is that this is related to me using PyTorch 1.0 .

The code can be found in https://github.com/2fasc/Distributed_Malaria/blob/master/src/federated_training.py

Traceback (most recent call last):
  File "/home/fasc/Documents/Distributed_Malaria/src/federated_training.py", line 143, in <module>
    simple_federated_model()
  File "/home/fasc/Documents/Distributed_Malaria/src/federated_training.py", line 66, in simple_federated_model
    federated=True,
  File "/home/fasc/Documents/Distributed_Malaria/src/auxiliaries.py", line 111, in train
    optimizer.step()
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/torch/optim/adam.py", line 94, in step
    exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 650, in overloaded_native_method
    response = method(*new_args, **new_kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 486, in overloaded_pointer_method
    response = owner.send_command(location, command)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 364, in send_command
    _ = self.send_msg(MSGTYPE.CMD, message, location=recipient)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 198, in send_msg
    bin_response = self._send_msg(bin_message, location)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/virtual.py", line 6, in _send_msg
    return location._recv_msg(message)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/virtual.py", line 9, in _recv_msg
    return self.recv_msg(message)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 229, in recv_msg
    response = self._message_router[msg_type](contents)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/workers/base.py", line 316, in execute_command
    getattr(_self, command_name)(*args, **kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 636, in overloaded_native_method
    raise route_method_exception(e, self, args, kwargs)
  File "/home/fasc/miniconda3/envs/malaria/lib/python3.6/site-packages/syft/frameworks/torch/hook.py", line 630, in overloaded_native_method
    response = method(*args, **kwargs)
TypeError: addcmul_() takes 2 positional arguments but 3 were given
Type

Most helpful comment

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

to this line

exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))

and this line

p.data.addcdiv_(-step_size, exp_avg, denom)

to this line

p.data.add_(exp_avg.div(denom).mul_(-step_size))

made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

All 21 comments

Hmm - it's not totally clear to me why this would be the case. Will require a deeper look. I don't think we have any unit tests around Adam so this might just require new functionality to get working.

I'm looking at it right now
First observation: Adam optim works in the setting of https://github.com/OpenMined/PySyft/blob/dev/examples/tutorials/Part%208%20-%20Federated%20Learning%20on%20MNIST%20using%20a%20CNN.ipynb if you replace the SGD optim with Adam.

Oh actually it fails, but after a certain number of batch iterations

Train Epoch: 1 [0/60032 (0%)]   Loss: 2.303720
Train Epoch: 1 [1920/60032 (3%)]    Loss: 3.148396
Train Epoch: 1 [3840/60032 (6%)]    Loss: 4.953410
Train Epoch: 1 [5760/60032 (10%)]   Loss: 6.552952
Train Epoch: 1 [7680/60032 (13%)]   Loss: 8.274794
Train Epoch: 1 [9600/60032 (16%)]   Loss: 9.585523
Train Epoch: 1 [11520/60032 (19%)]  Loss: 11.709550
Train Epoch: 1 [13440/60032 (22%)]  Loss: 14.805451
Train Epoch: 1 [15360/60032 (26%)]  Loss: 14.508088
Train Epoch: 1 [17280/60032 (29%)]  Loss: 18.298759
Train Epoch: 1 [19200/60032 (32%)]  Loss: 16.543806
Train Epoch: 1 [21120/60032 (35%)]  Loss: 19.365683
Train Epoch: 1 [23040/60032 (38%)]  Loss: 23.546900
Train Epoch: 1 [24960/60032 (42%)]  Loss: 26.782951
Train Epoch: 1 [26880/60032 (45%)]  Loss: 28.923700
Train Epoch: 1 [28800/60032 (48%)]  Loss: 28.413818
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

<ipython-input-6-72ddef6ce5b6> in train(args, model, device, federated_train_loader, optimizer, epoch)
      8         loss = F.nll_loss(output, target)
      9         loss.backward()
---> 10         optimizer.step()
     11         model.get() # <-- NEW: get the model back
     12         if batch_idx % args.log_interval == 0:

~/code/env/pysyft/lib/python3.7/site-packages/torch/optim/adam.py in step(self, closure)
     92                 # Decay the first and second moment running average coefficient
     93                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
---> 94                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
     95                 if amsgrad:
     96                     # Maintains the maximum of all 2nd moment running avg. till now

~/code/PySyft/syft/frameworks/torch/hook.py in overloaded_native_method(self, *args, **kwargs)
    650                 # Send the new command to the appropriate class and get the response
    651                 method = getattr(new_self, method_name)

I suspect there is a bug in the hooking as the loss is increasing

Ok here is the trouble: Adam uses momentum (actually: second moments of the gradients), which means it stores the gradients in a list and for each batch produces a correction of the current gradient based on the old ones. When changing of batch owner (so in Part 8 of tutorial at the middle of the epoch), you have now gradients from alice which you want to correct with moments of old gradients owned by bob: this is not possible as the data needs to be at the same location and it raises an error, which is here a bit tricky to find.
This is also why momentum is not supported so far on SGD.

The fix for this would probably imply to rewrite the optimizers. This is an important project and maybe could be correlated to the notions of aggregator needed for Federated or Secure Averaging.

Thank you for reporting the error!

Is there a way for us to perhaps clear out the momentum as needed? (reset to None)

Hey @2fasc - seems like the solution here is to have different optimizers for different machines (one for bob, another for alice, etc.) Eventually we'll write custom Federated optimizers but for now this is the solution :)

Fastest way to do this is probably to have a dict of optimizers, the key being the data.location, this would keep the code simple

Just wondering, is this related to #1909 ?

@mari-linhares exactly!

Yes, according @iamtrask , you can write a optimizer list like Part4 or Par10, then you can use Adam, Adadelta or other optimizer with momentum. @2fasc

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

to this line

exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))

and this line

p.data.addcdiv_(-step_size, exp_avg, denom)

to this line

p.data.add_(exp_avg.div(denom).mul_(-step_size))

made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

@LaRiffle I've experienced the increasing loss issues with my Adam as well.

But, since it wasn't raising errors, I tried to see where things went numerically wrong.

Somehow, in Adam's step() method, changing this line

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

to this line

exp_avg_sq.mul_(beta2).add_(grad.pow(2).mul_(1 - beta2))

and this line

p.data.addcdiv_(-step_size, exp_avg, denom)

to this line

p.data.add_(exp_avg.div(denom).mul_(-step_size))

made it work and the loss decreased with no problem.

So, it seems like addcmul_() and addcdiv_() functions have issues working with PySyft, but doing what those functions do step by step somehow fixes them.

cool

Just a small report, I'm running into a similar error, but it appears a few lines earlier:

exp_avg.mul_(beta1).add_(1 - beta1, grad)

The following error appears:

TypeError: add_() takes 1 positional argument but 2 were given

So it's to be a more general problem than only specific functions like addcmul_() and addcdiv_()

Just a small report, I'm running into a similar error, but it appears a few lines earlier:

exp_avg.mul_(beta1).add_(1 - beta1, grad)

The following error appears:

TypeError: add_() takes 1 positional argument but 2 were given

So it's to be a more general problem than only specific functions like addcmul_() and addcdiv_()

I encountered an error in the same line, but it raises

syft.exceptions.PureFrameworkTensorFoundError

Syft 0.1.26a1
Torch 1.1.0

Any solutions yet?

Yes - the solution was the FL Optimizer project. We can close this issue.

On Tue, Mar 17, 2020 at 12:49 PM Aditya Malte notifications@github.com
wrote:

Any solutions yet?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenMined/PySyft/issues/2070#issuecomment-600052348,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABBAZEQSAB62MQJMRJT2MQTRH5WW5ANCNFSM4HF52S2Q
.

Could you give a link please?

Yes - the solution was the FL Optimizer project. We can close this issue.
…
On Tue, Mar 17, 2020 at 12:49 PM Aditya Malte @.*> wrote: Any solutions yet? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2070 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBAZEQSAB62MQJMRJT2MQTRH5WW5ANCNFSM4HF52S2Q .

Alright, I'll close the issue then

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jvmncs picture jvmncs  Â·  3Comments

deevashwer picture deevashwer  Â·  4Comments

gmuraru picture gmuraru  Â·  4Comments

s-marta picture s-marta  Â·  4Comments

iamtrask picture iamtrask  Â·  3Comments