Apex: how to prevent overflow

Created on 5 Mar 2019 · 19Comments · Source: NVIDIA/apex

I am using the FP16_Optimizer in my Generative Adversarial Network for training. My model converges but it is slowed down because in each iteration i got "Overflow! Skipping step. Attempted loss scale: 1, reducing to 1".

Sometimes it happens sometimes it doesn't. I tried a lot of things to prevent it, changed the architecture or the loss function. All the small configurations made it better or worse but i don't see the reason why this is happening.

static_loss_scale gives me back nans for every value. Without FP16 training it works fine.

It would be helpful to know what the cause could be or how i can find out. Thank you

GAN

Source

Beinabih

Most helpful comment

Merged. As I said earlier, the user-facing changes are negligible:

a new optional num_losses argument to amp.initialize
a new optional loss_id argument to amp.scale_loss
the delay_unscale argument to amp.scale_loss should never be necessary anymore. In fact I don't recommend it because it can result in weird gotchas.

I'm still on the hook to post an actual GAN example, but in the meantime, here is the updated guidance:
https://nvidia.github.io/apex/advanced.html#multiple-models-optimizers-losses
The new backend should support arbitrary combinations of models/optimizers/losses (including GANs) as long as you adhere to this guidance. Let me know if anyone remains blocked.

mcarilli on 4 Apr 2019

👍4 😄2 🚀1 ❤1 🎉1

All 19 comments

Update
I tried out the api_refactor branch. The dynamic loss scaler reduces the loss scale until it gets to zero and then i got a float division by zero.

Beinabih on 6 Mar 2019

👍2

Gans are especially tricky because both optimizers need to skip their step and reduce the loss scale if either one encounters an overflow, so there needs to be some communication between them. I will rig this under the hood but it may require moving the calls to optimizerX.step() on the Python side. Without this cross-optimizer communication, I‘m not surprised you‘re seeing infs/nans, but it‘s not cause for alarm...yet.

I’m preparing a comprehensive example based on upstream DCGAN, but it may take me a couple days. I‘m sorry the api refactor is dragging out longer than expected, i‘m doing all i can.

mcarilli on 6 Mar 2019

👍3

New unified automatic mixed precision API is merged into master, and people are strongly encouraged to switch.
https://nvidia.github.io/apex/amp.html#
https://nvidia.github.io/apex/amp.html#transition-guide-for-old-api-users
It's still not quite possible to implement a GAN properly with the information that's posted, but I'm working on a comprehensive example, which I plan to finish in the next couple days.

mcarilli on 7 Mar 2019

👍2

Thank you
opt_level 2 still overflowing until float division by zero, but opt_level 1 seems to work nice.
I will check it for different models.

Beinabih on 8 Mar 2019

That's great to hear, but also a little surprising. If you wrap each backward pass naively with the new API, using the ordinary

with amp.scale_loss(loss1, optimizer1) as scaled_loss:
    scaled_loss.backward()

it won't error, but it's possible that during any iteration, one optimizer might detect an overflow and skip its step while the other optimizer doesn't. I have no idea what numerical effect that imbalance might have. It might train fine, or it might create unrecoverable NaNs/worse converged accuracy.

I am going to post an example showing how to use the API to ensure that both optimizers skip the step together, if either one detects an overflow. I'm glad your script is working, but I think having both optimizers decide whether or not to skip the step together is going to be more robust for GANs in general.

mcarilli on 8 Mar 2019

It really depends on the architecture i am using. I had luck with the first one

On the second architecture opt_level 01 give me only nans and infs, opt_level 02 works but i need to cast half() to my input otherwise i get the error Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same.

Since i am using a progressively growing gan i need to cast my model after growing back to float() otherwise i get the you dont need to call .half() on your model error. This happens because you use amp.initialize every time you grow but the model is set to half() already.

Edit: After a small bugfix, opt_level 01 works on this architecture as well

Beinabih on 9 Mar 2019

Still working on a GAN example, I've had some other fires come up, and I have to give a talk at GTC next Wednesday. This is important to me, but I probably won't get a chance to work on it until after my talk.

mcarilli on 14 Mar 2019

Hi @Beinabih, I also tried on a GAN example and it runs into 'Gradient overflow. Skipping step, reducing loss scale to xxx' until i got a float division by zero.

Does skipping step affects the accuracy of model?
Do you have any suggestions on the model to avoid such problem?

Thanks a lot!

1900zyh on 18 Mar 2019

👍1

Making this the master thread to track various multiple optimizer/model/loss scenarios, including GANs.

I'm working on enabling more flexible loss scaling that should accommodate all the cases I've heard so far https://github.com/NVIDIA/apex/commit/5e55200404f721d54a1ac1f82877addfe425f31a. This week is GTC and I have a talk to give, so my aim is to have the Python side done by next week. Sorry this keeps dragging out, it seems like there's always some fire or other cropping up.

mcarilli on 19 Mar 2019

Hey @1900zyh , for me it works following the amp guide and using Conservative Mixed Precision and initializing it this way:
[model1, model2], [optim1, optim2] = amp.initialize([model1, model2], [optim1, optim2],...)

Fast Mixed Precision still overflows until float division.

Beinabih on 19 Mar 2019

Thanks a lot @Beinabih. I initialized the models and optimizers in the suggestion way, however, I'm still suffering from gradient overflows using conservative mixed prevision (opt_level=O2). Do you have any other suggestions?

Thanks again!

1900zyh on 20 Mar 2019

@1900zyh conservative mixed precision is O1, not O2.

zplizzi on 21 Mar 2019

I've got a branch with optimizer handling rewritten to treat arbitrary combination of models, optimizers and losses: https://github.com/NVIDIA/apex/pull/232. The user-facing changes are negligible (an additional optional num_losses argument to amp.initialize, and an additional optional loss_id argument to amp.scale_loss).

I've got some tests for it now that are passing, but I don't think it's ready to merge yet. @gregjohnso your control flow from https://github.com/NVIDIA/apex/issues/163#issuecomment-468086500 would also be a good test. Do you have a minimal working sample, or can you explain in more detail what ComputeLoss2 is doing? Specifically, looking back at it, I'm still not sure why loss2.backward() requires the model2 section of loss1's subgraph to be retained.

mcarilli on 31 Mar 2019

Merged. As I said earlier, the user-facing changes are negligible:

a new optional num_losses argument to amp.initialize
a new optional loss_id argument to amp.scale_loss
the delay_unscale argument to amp.scale_loss should never be necessary anymore. In fact I don't recommend it because it can result in weird gotchas.

mcarilli on 4 Apr 2019

👍4 😄2 🚀1 ❤1 🎉1

Not sure if referencing the scattered related issues successfully tagged everyone so @Lausannen @MichaelDylan77 @TDeVries @ando-khachatryan @toemm @jshanna100 @LightToYang

@gregjohnso can you confirm the new changes enable your use case? Calling scaled_loss.backward(retain_graph=True) within one or more of the backward context managers should be permissible/not affect Amp's operation.

mcarilli on 5 Apr 2019

@mcarilli Sorry for my late reply, I will check the new update when I finish my recent training. Thanks for your awesome work !

Lausannen on 6 Apr 2019

@mcarilli I will look into it in the next few days. Thanks for the update!

gregjohnso on 11 Apr 2019

Hi @mcarilli,

I'm observing similar problems as described by others in this thread; O1 works just fine, but O2 leads to gradient overflows. The loss_scale is reduced continuously towards 0. Limiting the loss_scale to min_loss_scale=1.0 does not help, amp keeps on reporting gradient overflows continuously.

Some information about my system:

 Ubuntu 16.04
 3x 1080Ti
 CUDA 10.0
 CuDNN 7.4.1
 Apex 0.1

In this case I'm training a reference chatbot implementation, derived from the PyTorch chatbot tutorial. To train this sequence to sequence model, two optimisers are used, one for the encoder model and one for the decoder model.

Although, at this moment, it is a bit hard to share all my code, below you will find the relevant code snippets.

If you have any ideas on why I get the gradient overflows in my case I'm interested to know.

Thanks in advance,

-- Freddy

The training_model below calculates the loss when evaluated.

# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, embedding_size)

# Initialize encoder & decoder models
encoder = EncoderRNN(encoder_state_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, encoder_state_size, voc.num_words, decoder_n_layers, dropout)

training_model = Seq2SeqTrainModel(encoder, decoder)

logger.info(f"Transfer training model to device {device} ...")
# Use appropriate device
training_model = training_model.to(device)

# Initialize optimizers
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate*decoder_learning_ratio)



md5-4e7b139269a953945af3dc4fdf585c69



if use_mixed_precision:
    opt_level = args.mixed_precision_opt_level
    training_model, [encoder_optimizer, decoder_optimizer] = \
        amp.initialize(training_model,
                       [encoder_optimizer, decoder_optimizer],
                       opt_level=opt_level,
                       loss_scale="dynamic",
                       min_loss_scale=1.0)



md5-4e7b139269a953945af3dc4fdf585c69



if single_process_data_parallel:
    training_model = nn.DataParallel(training_model, dim=1)
elif distributed:
    if not use_mixed_precision:
        training_model = PT_DDP(training_model,
                                device_ids=[args.local_rank],
                                output_device=args.local_rank,
                                dim=1)
    else:
        training_model = APEX_DDP(training_model)



md5-98d05ca67ef7a636681db7328b8dd23a



        if not use_mixed_precision:
            super()._back_propagate_from(loss)
        else:
            with amp.scale_loss(loss, self.optimizers.values()) as scaled_loss:
                scaled_loss.backward()



md5-4e7b139269a953945af3dc4fdf585c69



        if not use_mixed_precision:
            _ = nn.utils.clip_grad_norm_(self.training_model.parameters(), clip)
        else:
            for optimizer in self.optimizers.values():
                _ = nn.utils.clip_grad_norm_(amp.master_params(optimizer), clip)

visionscaper on 16 Sep 2019

hi, guys, I am trying using two loss to handle only one model with one optimizer, and the second loss shares sub-graph in first loss, so I use retain_graph=True, but that didn't work, can you help me figure out what's wrong with my code?
@mcarilli

optimizer = torch.optim.SGD(...)
optimizer = LARC(optimizer=optimizer, trust_coefficient=0.001, clip=False)
model, optimizer = amp.initialize(model, optimizer, num_losses=2, opt_level='O1')
model = apex.parallel.DistributedDataParallel(model)

optimizer.zero_grad()
with amp.scale_loss(loss1, optimizer, loss_id=0) as scaled_loss:
        scaled_loss.backward(retain_graph=True)
with amp.scale_loss(loss2, optimizer, loss_id=1) as scaled_loss:
        scaled_loss.backward() <------ line 396
optimizer.step()

and the error log

Traceback (most recent call last):
  File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/xx/proto/train_pplv4.py", line 297, in main_worker
    train(train_loader, model, optimizer, epoch, args, criterion_list)
  File "/home/xx/proto/train_pplv4.py", line 396, in train
    scaled_loss.backward()
  File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/apex/parallel/distributed.py", line 400, in allreduce_hook
    self.comm_ready_buckets(param)
  File "/home/xx/venv/py35torch1.4apex/lib/python3.5/site-packages/apex/parallel/distributed.py", line 519, in comm_ready_buckets
    bucket_idx, bucket_loc = self.param_id_to_bucket[id(param)]
KeyError: 140642903453560