Turicreate: Object Detector training loss = NaN

Created on 26 Apr 2018 · 13Comments · Source: apple/turicreate

Hello, can you please suggest reasons that would make the value of the loss to start becoming NaN?

>>> object_detector = tc.object_detector.create(test_data_392, annotations='annotations',feature='image',max_iterations=500,verbose=True)
2018-04-26 00:50:15  Training   1/500  Loss  6.347
2018-04-26 00:50:30  Training   2/500  Loss  6.344
2018-04-26 00:50:44  Training   3/500  Loss  6.265
2018-04-26 00:50:56  Training   4/500  Loss    nan
2018-04-26 00:51:08  Training   5/500  Loss    nan
2018-04-26 00:51:19  Training   6/500  Loss    nan
2018-04-26 00:51:35  Training   7/500  Loss    nan
2018-04-26 00:51:46  Training   8/500  Loss    nan
2018-04-26 00:51:58  Training   9/500  Loss    nan
2018-04-26 00:52:09  Training  10/500  Loss    nan
2018-04-26 00:52:20  Training  11/500  Loss    nan
2018-04-26 00:52:31  Training  12/500  Loss    nan

bug

Source

andremontenegrof

Most helpful comment

@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?

znation on 17 May 2018

👍2

All 13 comments

@gustavla I'm guessing the loss should never be nan. Seems like a bug?

znation on 26 Apr 2018

I have capped the same annotations to 43 and nan is not appearing:

>>> object_detector = tc.object_detector.create(test_data_43)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:45:50  Training    1/1000  Loss  6.893
2018-04-26 10:46:01  Training    2/1000  Loss  6.828
2018-04-26 10:46:13  Training    3/1000  Loss  6.749
2018-04-26 10:46:24  Training    4/1000  Loss  6.648
2018-04-26 10:46:36  Training    5/1000  Loss  6.589
2018-04-26 10:46:48  Training    6/1000  Loss  6.472
2018-04-26 10:47:00  Training    7/1000  Loss  6.441
2018-04-26 10:47:11  Training    8/1000  Loss  6.374
2018-04-26 10:47:23  Training    9/1000  Loss  6.326
2018-04-26 10:47:34  Training   10/1000  Loss  6.244
2018-04-26 10:47:46  Training   11/1000  Loss  6.221
2018-04-26 10:47:57  Training   12/1000  Loss  6.206
2018-04-26 10:48:09  Training   13/1000  Loss  6.229

I have tried with 206 images and some warnings appeared:

>>> object_detector = tc.object_detector.create(test_data_206)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:52:09  Training    1/3000  Loss  6.182
2018-04-26 10:52:21  Training    2/3000  Loss  6.223
/Library/Python/2.7/site-packages/mxnet/image/detection.py:264: RuntimeWarning: invalid value encountered in divide
  coverage = self._calculate_areas(out[:, 1:]) * w * h / self._calculate_areas(label[:, 1:])
/Library/Python/2.7/site-packages/mxnet/image/detection.py:266: RuntimeWarning: invalid value encountered in greater
  valid = np.logical_and(valid, coverage > self.min_eject_coverage)
2018-04-26 10:52:31  Training    3/3000  Loss  6.162
2018-04-26 10:52:42  Training    4/3000  Loss    nan
2018-04-26 10:52:54  Training    5/3000  Loss    nan
2018-04-26 10:53:05  Training    6/3000  Loss    nan

andremontenegrof on 26 Apr 2018

@andremontenegrof Can you share your dataset by any chance. It seems like the divergence is a property of your dataset and we'd like to be able to trace it down.

srikris on 1 May 2018

@srikris can you please download it from: https://www.dropbox.com/sh/v8659d7e2unwa35/AADg_4hr5hMM8fxdvu6MTDoMa?dl=0

andremontenegrof on 2 May 2018

Hey @srikris please let me know if you have trouble downloading the dataset. Thanks!

andremontenegrof on 7 May 2018

@gustavla @srikris Sorry, I just found out I had some negative heights and widths.
After fixing that, the gradient stopped exploding.
I assumed some behaviour in the tool I used for annotation and this was the result :S
I have some negative x and y values. I don't know if that will be a problem for yolo as well.

andremontenegrof on 16 May 2018

@gustavla We should consider skipping data with bad bounding boxes.

srikris on 16 May 2018

Yes, we should definitely handle this better (I thought we did, but I don't see it anywhere in the code). Thanks for reporting this @andremontenegrof! I'll work on a fix.

gustavla on 17 May 2018

Hey! I believe raising exception is better than silently skipping. For example, it is always great if we have the program telling us that the data in row 142 is invalid.

andremontenegrof on 17 May 2018

@andremontenegrof Perhaps an eye-catching warning would be the way to go? Since this is a recoverable issue and the user has potentially been training for hundreds of iterations (or even more), then an exception caused by a single bad sample could be really frustrating.

For this to be effective though, we should probably have a mechanism that tracks warnings and then re-reports them or at least notifies the user to scroll up to read them once training completes. It doesn't matter how eye-catching it is if the user leaves the computer to let it run and it produces enough valid output after the warning to completely miss it.

gustavla on 17 May 2018

@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?

znation on 17 May 2018

👍2

@znation I like that idea!

gustavla on 17 May 2018

Indeed an exception would only make sense in the beginning. To present a set of warnings at the end is also an elegant solution. Thank you!

andremontenegrof on 19 May 2018

Was this page helpful?

0 / 5 - 0 ratings