Apex: Gradient Overflow Until ZeroFloatDivision on WGAN-GP

Created on 15 Dec 2019  路  4Comments  路  Source: NVIDIA/apex

Hello everyone,
I am trying to train a WGAN-GP model on O1 and O2 opt levels, but on the gradient penalty phase, I get gradient overflow error till the division with zero exception with both O1 and O2. I searched for some WGAN-GP apex codes, which I find in here;

https://github.com/hukkelas/progan-pytorch/blob/master/src/models/loss.py
using another scaled loss and scaling factor for GP. However, this also gives the same error.

My code for GP calculation as follows,

def _gradient_penalty(self, real_data, generated_data, gp_weight):

    batch_size = real_data.size()[0]

    # Calculate interpolation
    alpha = torch.rand(batch_size, 1, 1)
    alpha = alpha.expand_as(real_data)
    alpha.to(real_data.dtype)

    interpolated = alpha * real_data.data + (1 - alpha) * generated_data.data
    interpolated = Variable(interpolated, requires_grad=True)

    interpolated.to(real_data.dtype)
    # Calculate probability of interpolated examples
    prob_interpolated = self.discriminator(interpolated)

    # Calculate gradients of probabilities with respect to examples
    gradients = torch_grad(outputs=prob_interpolated, inputs=interpolated,
                           grad_outputs=torch.ones(prob_interpolated.size()),
                           create_graph=True, retain_graph=True)[0]

    gradients = gradients.view(gradients.size(0), -1)
    gradient_norm = gradients.norm(2, dim=1)
    gradient_penalty = ((gradient_norm - 1) ** 2).mean()
    return gp_weight * gradient_penalty`

gradient_penalty = self._gradient_penalty(x, generated_data, gp_weight) d_gen = self.discriminator(generated_data) d_real = self.discriminator(x) d_loss = d_gen.mean() - d_real.mean() + gradient_penalty with amp.scale_loss(d_loss, self.disc_opt, loss_id=1) as scaled_loss: scaled_loss.backward()
Can you help me about that please?
cc @mcarilli

All 4 comments

I have encountered the same problem.
Have you managed to solve your problem?

I have encountered the same problem.
Have you managed to solve your problem?

Unfortunately, I did not manage to solve the problem.

I met the quite same problem unfortunately T_T so any suggestions?

Hello,

I had an issue with the same symptoms in one of my projects recently. Although I didn't find the root cause, I found a hacky way around it.

The overflow occurred for me when I used 4 optimizers for 4 networks initialized as follows:

def init_amp(self):
    # mixed precision training
    models = [self.netG, self.netDec, self.netEnc, self.netD]
    optims = [self.optimizer_G, self.optimizer_Dec, self.optimizer_Enc, self.optimizer_D]
    models, optims = amp.initialize(models, optims, opt_level="O1", num_losses=4)
    self.netG, self.netDec, self.netEnc, self.netD = models
    self.optimizer_G, self.optimizer_Dec, self.optimizer_Enc, self.optimizer_D = optims

And the backward passes were done as follows (Generator side was done identically):

if self.opt.amp == 1:
    with amp.scale_loss(self.loss_D, self.optimizer_D, loss_id=0) as scaled_loss:
        scaled_loss.backward(retain_graph=True)
    with amp.scale_loss(self.loss_D, self.optimizer_Enc, loss_id=1) as scaled_loss:
        scaled_loss.backward()
else:
    self.loss_D.backward()
self.optimizer_D.step()
self.optimizer_Enc.step()

In my code some of the networks used the same losses. The problem was removed when I moved the networks that were using the same losses under the same optimizers, so that I have 4 networks and 2 optimizers initialized as follows.

def init_amp(self):
    # mixed precision training
    models = [nn.Sequential(self.netG, self.netDec), nn.Sequential(self.netEnc, self.netD)]
    optims = [self.optimizer_G, self.optimizer_D]
    models, optims = amp.initialize(models, optims, opt_level="O1", num_losses=2)
    self.netG, self.netDec = list(models[0].children())[0], list(models[0].children())[1]
    self.netEnc, self.netD = list(models[1].children())[0], list(models[1].children())[1]
    self.optimizer_G, self.optimizer_D = optims

After doing this and also changing the backward passes correspondingly, the overflow no longer occurred. Changing back to 4 optimizers immediately reproduces the issue for me.

I hope it helps someone with similar issue. Sorry for ugly code, I couldn't figure out a cleaner way and I didn't find any documentation for this kind of arrangement for multiple networks under the same optimizer.

Was this page helpful?
0 / 5 - 0 ratings