Hi I am having problem when I tried implementing GPT2. After some iterations I am getting an float division by zero error. I don't know why it is so .
optimizer = OpenAIAdam(optimizer_grouped_parameters,
lr=lr,
warmup=0.05,
t_total=num_train_optimization_steps)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)
for epoch in range(EPOCHS):
optimizer.zero_grad()
for batch,(X_train,y_train,weights) in tqdm(enumerate(train_loader),total=len(train_loader),leave=False):
X_train = X_train.cuda()
y_train = y_train.cuda()
weights = weights.cuda()
y_pred = model.forward(X_train)
loss = loss_fn(y_train,y_pred,weights)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
if (batch+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
optimizer.zero_grad()
The error screen shot I am attaching :


As you can see after 1091 iterations I am getting this error.
Hi @suchithtuple,
thanks for reporting this issue!
Could you use the default verbosity option and check, if the loss scaling is going down consistently before the ZeroDivisionError occurs?
Also, could you print out the loss value and check for NaN values?
Are you using any public repo so that we could try to reproduce this issue?
I am getting this :

Do you see these reductions until the loss scaling gets nearly zero?
Also, did you check for NaN values in the output/loss?

This is the last log after this I get a error
and also to my surprise I am getting all nan in my loss function. I don't know why it is happening , i have to fix it.
If you have a script to reproduce this issue or can post a link to a repo (with arguments how to trigger this issue), we could also look into it.
It's hard to tell what's going wrong, but this issue might be triggered, e.g. if your training is blowing up (activations/parameters taking some very high or low values).
Are you using any normalization layers (e.g. BatchNorm) or are you normalizing your input samples?
So I got to know where I am getting the problem.
In my loss function
class BinaryFocalLoss(nn.Module):
def __init__(self,gamma=0,eps=1e-8):
super().__init__()
self.gamma=gamma
self.eps=eps
self.sigmoid = nn.Sigmoid()
def forward(self,y_true,y_pred,weights=None):
y_pred = self.sigmoid(y_pred)
if weights is None:
weights = torch.ones((y_true.shape[0],)).type('torch.FloatTensor').cuda()
y_pred = torch.clamp(y_pred,self.eps,1-self.eps)
m = y_pred.shape[0]
if len(y_true.shape)==1:
# changing the shape of y_pred and y_true
y_true = torch.unsqueeze(y_true,1)
y_pred = torch.unsqueeze(y_pred,1)
loss = torch.sum(y_true*torch.pow(1-y_pred,self.gamma)*torch.log(y_pred) + \
(1-y_true)*torch.pow(y_pred,self.gamma)*torch.log(1-y_pred),dim=1)
loss = -torch.sum(weights*loss)/m
return loss
I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :
(Pdb) x
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
grad_fn=<SelectBackward>)
Notice the 1 over there and even after clamping I get this
(Pdb) torch.clamp(x,1e-8,1-1e-8)
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
grad_fn=<ClampBackward>)
Notice 1 is present over there and that's the reason I was getting nan . I think it's due to the fact that 1e-8 is not correct one for "O1". I tried with various powers and 1e-3 works well.
(Pdb) torch.clamp(x,1e-3,1-1e-3)
tensor([0.9971, 0.9727, 0.9678, 0.9990, 0.0873, 0.7739, 0.1796, 0.9658, 0.0010,
0.9990, 0.4358, 0.0459, 0.5908, 0.9819, 0.6123, 0.9438, 0.9961, 0.9990,
0.9980, 0.0128, 0.9990, 0.9883, 0.7993, 0.9697, 0.6138, 0.9634, 0.0010,
0.8174, 0.8174, 0.9990, 0.9497, 0.0782], device='cuda:0',
dtype=torch.float16, grad_fn=<ClampBackward>)
Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3
Thanks for the debugging!
Note that even in FP32 your loss function might create nan values.
I've created a small dummy example using your input values:
y_pred = torch.tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
9.4971e-01, 7.8247e-02], device=device)
y_true = torch.randint(0, 2, y_pred.size(), device=device).float()
y_true[3] = 1.
eps = 1e-8
y_pred = torch.clamp(y_pred,eps,1-eps)
gamma = 0.
y_true = y_true.unsqueeze(1)
y_pred = y_pred.unsqueeze(1)
loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred) + \
(1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred),dim=1)
loss = -1.0 * loss.sum() / y_pred.size(0)
print(loss)
> tensor(nan, device='cuda:0')
To fix this, you might want to add eps to the argument in torch.log:
loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred+eps) + \
(1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred+eps),dim=1)
Does this make sense or did I misunderstand your criterion?
I think doing this will work. I have to try it though but it makes sense. Thank you for the reply !!
I am closing the issue
Most helpful comment
So I got to know where I am getting the problem.
In my loss function
I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :
Notice the 1 over there and even after clamping I get this
Notice 1 is present over there and that's the reason I was getting
nan. I think it's due to the fact that 1e-8 is not correct one for "O1". I tried with various powers and 1e-3 works well.Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3