Turicreate: Object Detector training loss = NaN

Created on 26 Apr 2018  路  13Comments  路  Source: apple/turicreate

Hello, can you please suggest reasons that would make the value of the loss to start becoming NaN?

>>> object_detector = tc.object_detector.create(test_data_392, annotations='annotations',feature='image',max_iterations=500,verbose=True)
2018-04-26 00:50:15  Training   1/500  Loss  6.347
2018-04-26 00:50:30  Training   2/500  Loss  6.344
2018-04-26 00:50:44  Training   3/500  Loss  6.265
2018-04-26 00:50:56  Training   4/500  Loss    nan
2018-04-26 00:51:08  Training   5/500  Loss    nan
2018-04-26 00:51:19  Training   6/500  Loss    nan
2018-04-26 00:51:35  Training   7/500  Loss    nan
2018-04-26 00:51:46  Training   8/500  Loss    nan
2018-04-26 00:51:58  Training   9/500  Loss    nan
2018-04-26 00:52:09  Training  10/500  Loss    nan
2018-04-26 00:52:20  Training  11/500  Loss    nan
2018-04-26 00:52:31  Training  12/500  Loss    nan
bug

Most helpful comment

@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?

All 13 comments

@gustavla I'm guessing the loss should never be nan. Seems like a bug?

I have capped the same annotations to 43 and nan is not appearing:

>>> object_detector = tc.object_detector.create(test_data_43)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:45:50  Training    1/1000  Loss  6.893
2018-04-26 10:46:01  Training    2/1000  Loss  6.828
2018-04-26 10:46:13  Training    3/1000  Loss  6.749
2018-04-26 10:46:24  Training    4/1000  Loss  6.648
2018-04-26 10:46:36  Training    5/1000  Loss  6.589
2018-04-26 10:46:48  Training    6/1000  Loss  6.472
2018-04-26 10:47:00  Training    7/1000  Loss  6.441
2018-04-26 10:47:11  Training    8/1000  Loss  6.374
2018-04-26 10:47:23  Training    9/1000  Loss  6.326
2018-04-26 10:47:34  Training   10/1000  Loss  6.244
2018-04-26 10:47:46  Training   11/1000  Loss  6.221
2018-04-26 10:47:57  Training   12/1000  Loss  6.206
2018-04-26 10:48:09  Training   13/1000  Loss  6.229

I have tried with 206 images and some warnings appeared:

>>> object_detector = tc.object_detector.create(test_data_206)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:52:09  Training    1/3000  Loss  6.182
2018-04-26 10:52:21  Training    2/3000  Loss  6.223
/Library/Python/2.7/site-packages/mxnet/image/detection.py:264: RuntimeWarning: invalid value encountered in divide
  coverage = self._calculate_areas(out[:, 1:]) * w * h / self._calculate_areas(label[:, 1:])
/Library/Python/2.7/site-packages/mxnet/image/detection.py:266: RuntimeWarning: invalid value encountered in greater
  valid = np.logical_and(valid, coverage > self.min_eject_coverage)
2018-04-26 10:52:31  Training    3/3000  Loss  6.162
2018-04-26 10:52:42  Training    4/3000  Loss    nan
2018-04-26 10:52:54  Training    5/3000  Loss    nan
2018-04-26 10:53:05  Training    6/3000  Loss    nan

@andremontenegrof Can you share your dataset by any chance. It seems like the divergence is a property of your dataset and we'd like to be able to trace it down.

Hey @srikris please let me know if you have trouble downloading the dataset. Thanks!

@gustavla @srikris Sorry, I just found out I had some negative heights and widths.
After fixing that, the gradient stopped exploding.
I assumed some behaviour in the tool I used for annotation and this was the result :S
I have some negative x and y values. I don't know if that will be a problem for yolo as well.

@gustavla We should consider skipping data with bad bounding boxes.

Yes, we should definitely handle this better (I thought we did, but I don't see it anywhere in the code). Thanks for reporting this @andremontenegrof! I'll work on a fix.

Hey! I believe raising exception is better than silently skipping. For example, it is always great if we have the program telling us that the data in row 142 is invalid.

@andremontenegrof Perhaps an eye-catching warning would be the way to go? Since this is a recoverable issue and the user has potentially been training for hundreds of iterations (or even more), then an exception caused by a single bad sample could be really frustrating.

For this to be effective though, we should probably have a mechanism that tracks warnings and then re-reports them or at least notifies the user to scroll up to read them once training completes. It doesn't matter how eye-catching it is if the user leaves the computer to let it run and it produces enough valid output after the warning to completely miss it.

@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?

@znation I like that idea!

Indeed an exception would only make sense in the beginning. To present a set of warnings at the end is also an elegant solution. Thank you!

Was this page helpful?
0 / 5 - 0 ratings