Apex: ZeroDivisionError: float division by zero

Created on 13 Jun 2019 · 8Comments · Source: NVIDIA/apex

Hi I am having problem when I tried implementing GPT2. After some iterations I am getting an float division by zero error. I don't know why it is so .

optimizer = OpenAIAdam(optimizer_grouped_parameters, 
                       lr=lr,
                       warmup=0.05,
                       t_total=num_train_optimization_steps)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)

for epoch in range(EPOCHS):
    optimizer.zero_grad()
    for batch,(X_train,y_train,weights) in tqdm(enumerate(train_loader),total=len(train_loader),leave=False):
        X_train = X_train.cuda()
        y_train = y_train.cuda()
        weights = weights.cuda()
        y_pred = model.forward(X_train)
        loss = loss_fn(y_train,y_pred,weights)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        if (batch+1) % accumulation_steps == 0:             # Wait for several backward steps
            optimizer.step()                            # Now we can do an optimizer step
            optimizer.zero_grad()

The error screen shot I am attaching :
one

As you can see after 1091 iterations I am getting this error.

Source

suchithtuple

Most helpful comment

So I got to know where I am getting the problem.
In my loss function

class BinaryFocalLoss(nn.Module):
    def __init__(self,gamma=0,eps=1e-8):
        super().__init__()
        self.gamma=gamma
        self.eps=eps
        self.sigmoid = nn.Sigmoid()

    def forward(self,y_true,y_pred,weights=None):

        y_pred = self.sigmoid(y_pred)

        if weights is None:
            weights = torch.ones((y_true.shape[0],)).type('torch.FloatTensor').cuda()

        y_pred = torch.clamp(y_pred,self.eps,1-self.eps)
        m = y_pred.shape[0]

        if len(y_true.shape)==1:
            # changing the shape of y_pred and y_true
            y_true = torch.unsqueeze(y_true,1)
            y_pred = torch.unsqueeze(y_pred,1)

        loss = torch.sum(y_true*torch.pow(1-y_pred,self.gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,self.gamma)*torch.log(1-y_pred),dim=1)

        loss = -torch.sum(weights*loss)/m

        return loss

I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :

(Pdb) x
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<SelectBackward>)

Notice the 1 over there and even after clamping I get this

(Pdb) torch.clamp(x,1e-8,1-1e-8)
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<ClampBackward>)

Notice 1 is present over there and that's the reason I was getting nan . I think it's due to the fact that 1e-8 is not correct one for "O1". I tried with various powers and 1e-3 works well.

(Pdb) torch.clamp(x,1e-3,1-1e-3)
tensor([0.9971, 0.9727, 0.9678, 0.9990, 0.0873, 0.7739, 0.1796, 0.9658, 0.0010,
        0.9990, 0.4358, 0.0459, 0.5908, 0.9819, 0.6123, 0.9438, 0.9961, 0.9990,
        0.9980, 0.0128, 0.9990, 0.9883, 0.7993, 0.9697, 0.6138, 0.9634, 0.0010,
        0.8174, 0.8174, 0.9990, 0.9497, 0.0782], device='cuda:0',
       dtype=torch.float16, grad_fn=<ClampBackward>)

Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3

suchithtuple on 13 Jun 2019

👍2

All 8 comments

Hi @suchithtuple,

thanks for reporting this issue!
Could you use the default verbosity option and check, if the loss scaling is going down consistently before the ZeroDivisionError occurs?
Also, could you print out the loss value and check for NaN values?

Are you using any public repo so that we could try to reproduce this issue?

ptrblck on 13 Jun 2019

I am getting this :
error_iop

suchithtuple on 13 Jun 2019

Do you see these reductions until the loss scaling gets nearly zero?
Also, did you check for NaN values in the output/loss?

ptrblck on 13 Jun 2019

lopi
This is the last log after this I get a error
and also to my surprise I am getting all nan in my loss function. I don't know why it is happening , i have to fix it.

suchithtuple on 13 Jun 2019

If you have a script to reproduce this issue or can post a link to a repo (with arguments how to trigger this issue), we could also look into it.

It's hard to tell what's going wrong, but this issue might be triggered, e.g. if your training is blowing up (activations/parameters taking some very high or low values).
Are you using any normalization layers (e.g. BatchNorm) or are you normalizing your input samples?

ptrblck on 13 Jun 2019

So I got to know where I am getting the problem.
In my loss function

class BinaryFocalLoss(nn.Module):
    def __init__(self,gamma=0,eps=1e-8):
        super().__init__()
        self.gamma=gamma
        self.eps=eps
        self.sigmoid = nn.Sigmoid()

    def forward(self,y_true,y_pred,weights=None):

        y_pred = self.sigmoid(y_pred)

        if weights is None:
            weights = torch.ones((y_true.shape[0],)).type('torch.FloatTensor').cuda()

        y_pred = torch.clamp(y_pred,self.eps,1-self.eps)
        m = y_pred.shape[0]

        if len(y_true.shape)==1:
            # changing the shape of y_pred and y_true
            y_true = torch.unsqueeze(y_true,1)
            y_pred = torch.unsqueeze(y_pred,1)

        loss = torch.sum(y_true*torch.pow(1-y_pred,self.gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,self.gamma)*torch.log(1-y_pred),dim=1)

        loss = -torch.sum(weights*loss)/m

        return loss

I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :

(Pdb) x
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<SelectBackward>)

Notice the 1 over there and even after clamping I get this

(Pdb) torch.clamp(x,1e-8,1-1e-8)
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<ClampBackward>)

Notice 1 is present over there and that's the reason I was getting nan . I think it's due to the fact that 1e-8 is not correct one for "O1". I tried with various powers and 1e-3 works well.

(Pdb) torch.clamp(x,1e-3,1-1e-3)
tensor([0.9971, 0.9727, 0.9678, 0.9990, 0.0873, 0.7739, 0.1796, 0.9658, 0.0010,
        0.9990, 0.4358, 0.0459, 0.5908, 0.9819, 0.6123, 0.9438, 0.9961, 0.9990,
        0.9980, 0.0128, 0.9990, 0.9883, 0.7993, 0.9697, 0.6138, 0.9634, 0.0010,
        0.8174, 0.8174, 0.9990, 0.9497, 0.0782], device='cuda:0',
       dtype=torch.float16, grad_fn=<ClampBackward>)

Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3

suchithtuple on 13 Jun 2019

👍2

Thanks for the debugging!
Note that even in FP32 your loss function might create nan values.
I've created a small dummy example using your input values:

y_pred = torch.tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device=device)

y_true = torch.randint(0, 2, y_pred.size(), device=device).float()
y_true[3] = 1.

eps = 1e-8
y_pred = torch.clamp(y_pred,eps,1-eps)

gamma = 0.
y_true = y_true.unsqueeze(1)
y_pred = y_pred.unsqueeze(1)
loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred),dim=1)
loss = -1.0 * loss.sum() / y_pred.size(0)
print(loss)
> tensor(nan, device='cuda:0')

To fix this, you might want to add eps to the argument in torch.log:

loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred+eps) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred+eps),dim=1)

Does this make sense or did I misunderstand your criterion?

ptrblck on 14 Jun 2019

❤1

I think doing this will work. I have to try it though but it makes sense. Thank you for the reply !!
I am closing the issue

suchithtuple on 14 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings