Vision: getting loss in nan

Created on 19 Mar 2020  路  8Comments  路  Source: pytorch/vision

if i am running with PennFudanDataset dataset then it's running but if i am changing dataset then i am getting this error .....

Epoch: [0] [ 0/155] eta: 0:01:05 lr: 0.000037 loss: 3.9229 (3.9229) loss_classifier: 0.9148 (0.9148) loss_box_reg: 0.1397 (0.1397) loss_mask: 2.8494 (2.8494) loss_objectness: 0.0084 (0.0084) loss_rpn_box_reg: 0.0107 (0.0107) time: 0.4194 data: 0.0865 max mem: 2032
Loss is nan, stopping training
{'loss_classifier': tensor(0.9080, device='cuda:1', grad_fn=), 'loss_box_reg': tensor(nan, device='cuda:1', grad_fn=), 'loss_mask': tensor(3.0519, device='cuda:1', grad_fn=), 'loss_objectness': tensor(5.4485, device='cuda:1', grad_fn=), 'loss_rpn_box_reg': tensor(5.6734, device='cuda:1', grad_fn=)}

An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

can anyone help??

models reference scripts object detection

Most helpful comment

I've never worked with detection datasets, but from what I understand this following snippet should run without an AssertionError.

dataset = PennFudanDataset("/path/to/PennFudan/root", transforms=None)

for img, target in dataset:
    width, height = img.size

    for box in target["boxes"]:
        xmin, ymin, xmax, ymax = box.tolist()

        assert xmin >= 0
        assert xmax <= width
        assert xmin <= xmax

        assert ymin >= 0
        assert ymax <= height
        assert ymin <= ymax

Try this with your custom dataset and get back if this runs through and you still get NaN loss.

All 8 comments

Hi @vivekdeepquanty

can anyone help??

Just from the console output I would say it is impossible to say what is going wrong.

if i am running with PennFudanDataset dataset then it's running but if i am changing dataset then i am getting this error .....

That suggests that there is something wrong with the dataset you switched to. Have you verified that the dataset is behaving like you want it to?


Next steps:

  1. Strip down every complexity in your code that is not needed to reproduce the error.
  2. If you think that the behavior is a bug in torch or torchvision open an issue with bug report template and follow the steps listed there.
  3. If you don't think this is a bug, please post your minimal example with an accompanying question in our discussion forum, which is our primary means of support.

Hi @vivekdeepquanty

can anyone help??

Just from the console output I would say it is impossible to say what is going wrong.

if i am running with PennFudanDataset dataset then it's running but if i am changing dataset then i am getting this error .....

That suggests that there is something wrong with the dataset you switched to. Have you verified that the dataset is behaving like you want it to?

Next steps:

1. Strip down every complexity in your code that is not needed to reproduce the error.

2. If you think that the behavior is a bug in `torch` or `torchvision` open an issue with bug report template and follow the steps listed there.

3. If you don't think this is a bug, please post your minimal example with an accompanying question in our [discussion forum](https://discuss.pytorch.org/), which is our primary means of support.

Thanks@pmeier

I had not modified any thing in code even no of class is also 2 only change is i am using my custom dataset.
When i am using PennFudanDataset code is working fine.

I had not modified any thing in code even no of class is also 2

What code are you using?

only change is i am using my custom dataset.
When i am using PennFudanDataset code is working fine.

This implies that your custom dataset is not working as intended. Please check this first.

@vivekdeepquanty I think you might have invalid boxes in your dataset (for example, boxes with negative size).

See my comment in https://github.com/pytorch/vision/issues/997#issuecomment-499429297

As I believe this is the same issue, I'm closing this one, but let us know if this isn't the case.

Is txt file is required for training??
Because i am using only mask and img folder.
Proble-Loss in nan

I've never worked with detection datasets, but from what I understand this following snippet should run without an AssertionError.

dataset = PennFudanDataset("/path/to/PennFudan/root", transforms=None)

for img, target in dataset:
    width, height = img.size

    for box in target["boxes"]:
        xmin, ymin, xmax, ymax = box.tolist()

        assert xmin >= 0
        assert xmax <= width
        assert xmin <= xmax

        assert ymin >= 0
        assert ymax <= height
        assert ymin <= ymax

Try this with your custom dataset and get back if this runs through and you still get NaN loss.

I've never worked with detection datasets, but from what I understand this following snippet should run without an AssertionError.

dataset = PennFudanDataset("/path/to/PennFudan/root", transforms=None)

for img, target in dataset:
    width, height = img.size

    for box in target["boxes"]:
        xmin, ymin, xmax, ymax = box.tolist()

        assert xmin >= 0
        assert xmax <= width
        assert xmin <= xmax

        assert ymin >= 0
        assert ymax <= height
        assert ymin <= ymax

Try this with your custom dataset and get back if this runs through and you still get NaN loss.

still i am getting same error

At this point there is nothing we can do without seeing your code. Please strip down every complexity in your code that is not needed to reproduce the error and post your code here afterwards.

Was this page helpful?
0 / 5 - 0 ratings