Hi, I am trying to use 2 different optimizers in one model. For example, one encoder-decoder model, decoder model use Adam optimizer while encoder model use SGD. How do I use Apex to backward loss?
I was about to open the exact same issue. I think this feature would be super useful, especially when combined with distributed, because distributed will not let you perform updates on a subset of the parameters (i.e. only the encoder parameters) which currently forces you to have a different optimizer for the encoder and for the decoder. But apex won't work with 2 optimizers..
This looks relevant though: https://github.com/NVIDIA/apex/tree/master/apex/amp#multiple-optimizers-or-backward-passes
Yes, https://github.com/NVIDIA/apex/issues/163#issuecomment-465586715 is the correct way for now with Amp. With FP16_Optimizer, you can also wrap each optimizer instance individually.
Also, stay tuned for a new API that unifies the current Amp and FP16_Optimizer. My new API is tracked in branch api_refactor, although I don't have examples and documentation yet. I'll merge it into master by the end of February, and one of the examples will be a GAN with multiple losses and optimizers.
Thank you for your quick reply ! @glample @mcarilli I will try Amp firstly and I am looking forward for your new API.
How would I implement something like this:
optimizer1.zero_grad()
optimizer2.zero_grad()
y_hat = model2(model1(x))
#...
loss1 = loss_fn1(y_hat, y)
loss1.backward(retain_graph=True)
optimizer1.step() #has gradient contributions from loss1
# ...
loss2 = ComputeLoss2(model2)
loss2.backward()
# ...
optimizer2.step() # has gradient contributions from both loss1 and loss2, but only applied to model2
@gregjohnso Does optimizer1.step() act on model2 or only on model1? Also, why do you need retain_graph=True for loss1.backward? In that minimal sample I don't see where you are backwarding through loss1 again (or any of its subgraphs).
What I'm trying to do is have loss1 apply to model1 and model2, and loss2 apply only to model2.
optimizer1 acts only on model 1
optimizer2 acts only on model2
I need to retain_graph=True because otherwise the first backward/step would release buffers from the model2 subgraph, and the loss2.backward would throw an error.
Hope that clarifies what I'm doing.
I still don't see how the line loss2 = ComputeLoss2(model2) makes sense, because model2 is a model, not an output or anything. Maybe that's a typo. But I understand your control flow. That is tricky to handle with what I have exposed currently (at least in a way that will also allow more general cases like yours). I'll let you know when I have a solution.
Hi, I notice that you have updated your new API, and I want to know when will the GAN example be updated ?
@mcarilli I'm facing a situation where I have a single model, with one embedding layer which is sparse. I have 2 optimizers: one for the sparse embeddings (optimizer_sparse), and one for every other parameter (optimizer_dense).
What I would usually do is:
loss.backward()
optimizer_dense.step()
optimizer_sparse.step()
Now I would like to do this, but in float16. Could the current API handle this situation?
Not quite, as it assumes a backward pass is associated with a particular optimizer. The old API did as well. I know how I can support this in the future, as well as @gregjohnso 's case, but I need a few days to implement it.
@mcarilli Thank you! I am looking forward for this since it is important in my case
Deduplicating to #179
Most helpful comment
Yes, https://github.com/NVIDIA/apex/issues/163#issuecomment-465586715 is the correct way for now with Amp. With FP16_Optimizer, you can also wrap each optimizer instance individually.
Also, stay tuned for a new API that unifies the current Amp and FP16_Optimizer. My new API is tracked in branch api_refactor, although I don't have examples and documentation yet. I'll merge it into master by the end of February, and one of the examples will be a GAN with multiple losses and optimizers.