Hello, can you please suggest reasons that would make the value of the loss to start becoming NaN?
>>> object_detector = tc.object_detector.create(test_data_392, annotations='annotations',feature='image',max_iterations=500,verbose=True)
2018-04-26 00:50:15 Training 1/500 Loss 6.347
2018-04-26 00:50:30 Training 2/500 Loss 6.344
2018-04-26 00:50:44 Training 3/500 Loss 6.265
2018-04-26 00:50:56 Training 4/500 Loss nan
2018-04-26 00:51:08 Training 5/500 Loss nan
2018-04-26 00:51:19 Training 6/500 Loss nan
2018-04-26 00:51:35 Training 7/500 Loss nan
2018-04-26 00:51:46 Training 8/500 Loss nan
2018-04-26 00:51:58 Training 9/500 Loss nan
2018-04-26 00:52:09 Training 10/500 Loss nan
2018-04-26 00:52:20 Training 11/500 Loss nan
2018-04-26 00:52:31 Training 12/500 Loss nan
@gustavla I'm guessing the loss should never be nan. Seems like a bug?
I have capped the same annotations to 43 and nan is not appearing:
>>> object_detector = tc.object_detector.create(test_data_43)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:45:50 Training 1/1000 Loss 6.893
2018-04-26 10:46:01 Training 2/1000 Loss 6.828
2018-04-26 10:46:13 Training 3/1000 Loss 6.749
2018-04-26 10:46:24 Training 4/1000 Loss 6.648
2018-04-26 10:46:36 Training 5/1000 Loss 6.589
2018-04-26 10:46:48 Training 6/1000 Loss 6.472
2018-04-26 10:47:00 Training 7/1000 Loss 6.441
2018-04-26 10:47:11 Training 8/1000 Loss 6.374
2018-04-26 10:47:23 Training 9/1000 Loss 6.326
2018-04-26 10:47:34 Training 10/1000 Loss 6.244
2018-04-26 10:47:46 Training 11/1000 Loss 6.221
2018-04-26 10:47:57 Training 12/1000 Loss 6.206
2018-04-26 10:48:09 Training 13/1000 Loss 6.229
I have tried with 206 images and some warnings appeared:
>>> object_detector = tc.object_detector.create(test_data_206)
Using 'image' as feature column
Using 'annotations' as annotations column
2018-04-26 10:52:09 Training 1/3000 Loss 6.182
2018-04-26 10:52:21 Training 2/3000 Loss 6.223
/Library/Python/2.7/site-packages/mxnet/image/detection.py:264: RuntimeWarning: invalid value encountered in divide
coverage = self._calculate_areas(out[:, 1:]) * w * h / self._calculate_areas(label[:, 1:])
/Library/Python/2.7/site-packages/mxnet/image/detection.py:266: RuntimeWarning: invalid value encountered in greater
valid = np.logical_and(valid, coverage > self.min_eject_coverage)
2018-04-26 10:52:31 Training 3/3000 Loss 6.162
2018-04-26 10:52:42 Training 4/3000 Loss nan
2018-04-26 10:52:54 Training 5/3000 Loss nan
2018-04-26 10:53:05 Training 6/3000 Loss nan
@andremontenegrof Can you share your dataset by any chance. It seems like the divergence is a property of your dataset and we'd like to be able to trace it down.
@srikris can you please download it from: https://www.dropbox.com/sh/v8659d7e2unwa35/AADg_4hr5hMM8fxdvu6MTDoMa?dl=0
Hey @srikris please let me know if you have trouble downloading the dataset. Thanks!
@gustavla @srikris Sorry, I just found out I had some negative heights and widths.
After fixing that, the gradient stopped exploding.
I assumed some behaviour in the tool I used for annotation and this was the result :S
I have some negative x and y values. I don't know if that will be a problem for yolo as well.
@gustavla We should consider skipping data with bad bounding boxes.
Yes, we should definitely handle this better (I thought we did, but I don't see it anywhere in the code). Thanks for reporting this @andremontenegrof! I'll work on a fix.
Hey! I believe raising exception is better than silently skipping. For example, it is always great if we have the program telling us that the data in row 142 is invalid.
@andremontenegrof Perhaps an eye-catching warning would be the way to go? Since this is a recoverable issue and the user has potentially been training for hundreds of iterations (or even more), then an exception caused by a single bad sample could be really frustrating.
For this to be effective though, we should probably have a mechanism that tracks warnings and then re-reports them or at least notifies the user to scroll up to read them once training completes. It doesn't matter how eye-catching it is if the user leaves the computer to let it run and it produces enough valid output after the warning to completely miss it.
@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?
@znation I like that idea!
Indeed an exception would only make sense in the beginning. To present a set of warnings at the end is also an elegant solution. Thank you!
Most helpful comment
@gustavla maybe we could do a quick single-pass over the images to check correctness of bounding boxes before training starts (and error out if the data is bad)?