I am using the FP16_Optimizer in my Generative Adversarial Network for training. My model converges but it is slowed down because in each iteration i got "Overflow! Skipping step. Attempted loss scale: 1, reducing to 1".
Sometimes it happens sometimes it doesn't. I tried a lot of things to prevent it, changed the architecture or the loss function. All the small configurations made it better or worse but i don't see the reason why this is happening.
static_loss_scale gives me back nans for every value. Without FP16 training it works fine.
It would be helpful to know what the cause could be or how i can find out. Thank you
Update
I tried out the api_refactor branch. The dynamic loss scaler reduces the loss scale until it gets to zero and then i got a float division by zero.
Gans are especially tricky because both optimizers need to skip their step and reduce the loss scale if either one encounters an overflow, so there needs to be some communication between them. I will rig this under the hood but it may require moving the calls to optimizerX.step() on the Python side. Without this cross-optimizer communication, I鈥榤 not surprised you鈥榬e seeing infs/nans, but it鈥榮 not cause for alarm...yet.
I鈥檓 preparing a comprehensive example based on upstream DCGAN, but it may take me a couple days. I鈥榤 sorry the api refactor is dragging out longer than expected, i鈥榤 doing all i can.
New unified automatic mixed precision API is merged into master, and people are strongly encouraged to switch.
https://nvidia.github.io/apex/amp.html#
https://nvidia.github.io/apex/amp.html#transition-guide-for-old-api-users
It's still not quite possible to implement a GAN properly with the information that's posted, but I'm working on a comprehensive example, which I plan to finish in the next couple days.
Thank you
opt_level 2 still overflowing until float division by zero, but opt_level 1 seems to work nice.
I will check it for different models.
That's great to hear, but also a little surprising. If you wrap each backward pass naively with the new API, using the ordinary
with amp.scale_loss(loss1, optimizer1) as scaled_loss:
scaled_loss.backward()
it won't error, but it's possible that during any iteration, one optimizer might detect an overflow and skip its step while the other optimizer doesn't. I have no idea what numerical effect that imbalance might have. It might train fine, or it might create unrecoverable NaNs/worse converged accuracy.
I am going to post an example showing how to use the API to ensure that both optimizers skip the step together, if either one detects an overflow. I'm glad your script is working, but I think having both optimizers decide whether or not to skip the step together is going to be more robust for GANs in general.
It really depends on the architecture i am using. I had luck with the first one
On the second architecture opt_level 01 give me only nans and infs, opt_level 02 works but i need to cast half() to my input otherwise i get the error Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same.
Since i am using a progressively growing gan i need to cast my model after growing back to float() otherwise i get the you dont need to call .half() on your model error. This happens because you use amp.initialize every time you grow but the model is set to half() already.
Edit: After a small bugfix, opt_level 01 works on this architecture as well
Still working on a GAN example, I've had some other fires come up, and I have to give a talk at GTC next Wednesday. This is important to me, but I probably won't get a chance to work on it until after my talk.
Hi @Beinabih, I also tried on a GAN example and it runs into 'Gradient overflow. Skipping step, reducing loss scale to xxx' until i got a float division by zero.
Does skipping step affects the accuracy of model?
Do you have any suggestions on the model to avoid such problem?
Thanks a lot!
Making this the master thread to track various multiple optimizer/model/loss scenarios, including GANs.
I'm working on enabling more flexible loss scaling that should accommodate all the cases I've heard so far https://github.com/NVIDIA/apex/commit/5e55200404f721d54a1ac1f82877addfe425f31a. This week is GTC and I have a talk to give, so my aim is to have the Python side done by next week. Sorry this keeps dragging out, it seems like there's always some fire or other cropping up.
Hey @1900zyh , for me it works following the amp guide and using Conservative Mixed Precision and initializing it this way:
[model1, model2], [optim1, optim2] = amp.initialize([model1, model2], [optim1, optim2],...)
Fast Mixed Precision still overflows until float division.
Thanks a lot @Beinabih. I initialized the models and optimizers in the suggestion way, however, I'm still suffering from gradient overflows using conservative mixed prevision (opt_level=O2). Do you have any other suggestions?
Thanks again!
@1900zyh conservative mixed precision is O1, not O2.
I've got a branch with optimizer handling rewritten to treat arbitrary combination of models, optimizers and losses: https://github.com/NVIDIA/apex/pull/232. The user-facing changes are negligible (an additional optional num_losses argument to amp.initialize, and an additional optional loss_id argument to amp.scale_loss).
I've got some tests for it now that are passing, but I don't think it's ready to merge yet. @gregjohnso your control flow from https://github.com/NVIDIA/apex/issues/163#issuecomment-468086500 would also be a good test. Do you have a minimal working sample, or can you explain in more detail what ComputeLoss2 is doing? Specifically, looking back at it, I'm still not sure why loss2.backward() requires the model2 section of loss1's subgraph to be retained.
Merged. As I said earlier, the user-facing changes are negligible:
num_losses argument to amp.initializeloss_id argument to amp.scale_lossdelay_unscale argument to amp.scale_loss should never be necessary anymore. In fact I don't recommend it because it can result in weird gotchas.I'm still on the hook to post an actual GAN example, but in the meantime, here is the updated guidance:
https://nvidia.github.io/apex/advanced.html#multiple-models-optimizers-losses
The new backend should support arbitrary combinations of models/optimizers/losses (including GANs) as long as you adhere to this guidance. Let me know if anyone remains blocked.
Not sure if referencing the scattered related issues successfully tagged everyone so @Lausannen @MichaelDylan77 @TDeVries @ando-khachatryan @toemm @jshanna100 @LightToYang
@gregjohnso can you confirm the new changes enable your use case? Calling scaled_loss.backward(retain_graph=True) within one or more of the backward context managers should be permissible/not affect Amp's operation.
@mcarilli Sorry for my late reply, I will check the new update when I finish my recent training. Thanks for your awesome work !
@mcarilli I will look into it in the next few days. Thanks for the update!
Hi @mcarilli,
I'm observing similar problems as described by others in this thread; O1 works just fine, but O2 leads to gradient overflows. The loss_scale is reduced continuously towards 0. Limiting the loss_scale to min_loss_scale=1.0 does not help, amp keeps on reporting gradient overflows continuously.
Some information about my system:
Ubuntu 16.04
3x 1080Ti
CUDA 10.0
CuDNN 7.4.1
Apex 0.1
In this case I'm training a reference chatbot implementation, derived from the PyTorch chatbot tutorial. To train this sequence to sequence model, two optimisers are used, one for the encoder model and one for the decoder model.
Although, at this moment, it is a bit hard to share all my code, below you will find the relevant code snippets.
If you have any ideas on why I get the gradient overflows in my case I'm interested to know.
Thanks in advance,
-- Freddy
The training_model below calculates the loss when evaluated.
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, embedding_size)
# Initialize encoder & decoder models
encoder = EncoderRNN(encoder_state_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, encoder_state_size, voc.num_words, decoder_n_layers, dropout)
training_model = Seq2SeqTrainModel(encoder, decoder)
logger.info(f"Transfer training model to device {device} ...")
# Use appropriate device
training_model = training_model.to(device)
# Initialize optimizers
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate*decoder_learning_ratio)
md5-4e7b139269a953945af3dc4fdf585c69
if use_mixed_precision:
opt_level = args.mixed_precision_opt_level
training_model, [encoder_optimizer, decoder_optimizer] = \
amp.initialize(training_model,
[encoder_optimizer, decoder_optimizer],
opt_level=opt_level,
loss_scale="dynamic",
min_loss_scale=1.0)
md5-4e7b139269a953945af3dc4fdf585c69
if single_process_data_parallel:
training_model = nn.DataParallel(training_model, dim=1)
elif distributed:
if not use_mixed_precision:
training_model = PT_DDP(training_model,
device_ids=[args.local_rank],
output_device=args.local_rank,
dim=1)
else:
training_model = APEX_DDP(training_model)
md5-98d05ca67ef7a636681db7328b8dd23a
if not use_mixed_precision:
super()._back_propagate_from(loss)
else:
with amp.scale_loss(loss, self.optimizers.values()) as scaled_loss:
scaled_loss.backward()
md5-4e7b139269a953945af3dc4fdf585c69
if not use_mixed_precision:
_ = nn.utils.clip_grad_norm_(self.training_model.parameters(), clip)
else:
for optimizer in self.optimizers.values():
_ = nn.utils.clip_grad_norm_(amp.master_params(optimizer), clip)
hi, guys, I am trying using two loss to handle only one model with one optimizer, and the second loss shares sub-graph in first loss, so I use retain_graph=True, but that didn't work, can you help me figure out what's wrong with my code?
@mcarilli
optimizer = torch.optim.SGD(...)
optimizer = LARC(optimizer=optimizer, trust_coefficient=0.001, clip=False)
model, optimizer = amp.initialize(model, optimizer, num_losses=2, opt_level='O1')
model = apex.parallel.DistributedDataParallel(model)
optimizer.zero_grad()
with amp.scale_loss(loss1, optimizer, loss_id=0) as scaled_loss:
scaled_loss.backward(retain_graph=True)
with amp.scale_loss(loss2, optimizer, loss_id=1) as scaled_loss:
scaled_loss.backward() <------ line 396
optimizer.step()
and the error log
Traceback (most recent call last):
File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/xx/proto/train_pplv4.py", line 297, in main_worker
train(train_loader, model, optimizer, epoch, args, criterion_list)
File "/home/xx/proto/train_pplv4.py", line 396, in train
scaled_loss.backward()
File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/apex/parallel/distributed.py", line 400, in allreduce_hook
self.comm_ready_buckets(param)
File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/apex/parallel/distributed.py", line 519, in comm_ready_buckets
bucket_idx, bucket_loc = self.param_id_to_bucket[id(param)]
KeyError: 140642903453560
Most helpful comment
Merged. As I said earlier, the user-facing changes are negligible:
num_lossesargument to amp.initializeloss_idargument to amp.scale_lossdelay_unscaleargument toamp.scale_lossshould never be necessary anymore. In fact I don't recommend it because it can result in weird gotchas.I'm still on the hook to post an actual GAN example, but in the meantime, here is the updated guidance:
https://nvidia.github.io/apex/advanced.html#multiple-models-optimizers-losses
The new backend should support arbitrary combinations of models/optimizers/losses (including GANs) as long as you adhere to this guidance. Let me know if anyone remains blocked.