Darkflow: NaN training loss *after* first iteration

Created on 2 Feb 2018  路  6Comments  路  Source: thtrieu/darkflow

I am attempting to further train the yolo network on the coco dataset as a starting point for "training my own data." I converted the coco json into .xml annotations, but when I try to train, I get NaN loss starting at the second step. Most issues regarding NaN loss seem to be centered around incorrect annotations, however, I have checked over mine multiple times for correctness. I have copied coco.names and overwritten labels.txt with it.

I use the following command to train:
./flow --model cfg/yolo.cfg --load bin/yolo.weights --train --dataset /path/to/JPEGImages/ --annotation /path/to/Annotations/ --gpu 0.95

I get the following when training:

Training statistics: 
        Learning rate : 1e-05
        Batch size    : 16
        Epoch number  : 1000
        Backup every  : 2000
step 1 - loss 9.513481140136719 - moving ave loss 9.51348114013672
step 2 - loss nan - moving ave loss nan
step 3 - loss nan - moving ave loss nan

I've read in other issues that there is a parse history that one can delete if there was previously a mistake in the annotations before, but I cannot find it. Looking for any ideas on reason for this to happen.

Most helpful comment

A lot of NaN problems can actually be fixed by clipping the gradients here:
https://github.com/thtrieu/darkflow/blob/master/darkflow/net/help.py#L18

All 6 comments

I'm also having the same issue as above, except, I've reached step 46K before it started happening. No error is showing when I start my training, and previous issues provide next to no instructions on how to fix this problem.

I don't necessarily want to open a new issue, however, I'm training my network on a custom dataset with 3 labels using a copy of the tiny-yolo-voc weights.

My command to train is as follows:
./flow --model cfg/tiny-yolo-voc-3c.cfg --load -1 --train --annotation train/annotations --dataset train/images

Here's a preview of when it started happening (it never happened prior to these steps):

step 41734 - loss 1.8608825206756592 - moving ave loss 1.2506067808148458
step 41735 - loss 0.8213188052177429 - moving ave loss 1.2076779832551356
step 41736 - loss 0.8962286710739136 - moving ave loss 1.1765330520370134
step 41737 - loss 1.6367404460906982 - moving ave loss 1.222553791442382
step 41738 - loss 0.7614827156066895 - moving ave loss 1.1764466838588128
step 41739 - loss nan - moving ave loss nan
step 41740 - loss nan - moving ave loss nan
step 41741 - loss nan - moving ave loss nan
libpng warning: iCCP: known incorrect sRGB profile
step 41742 - loss nan - moving ave loss nan
step 41743 - loss nan - moving ave loss nan
step 41744 - loss nan - moving ave loss nan
step 41745 - loss nan - moving ave loss nan

I get the following when initiating training:

```Parsing cfg/tiny-yolo-voc-3c.cfg
Loading None ...
Finished in 6.341934204101562e-05s

Building net ...
Source | Train? | Layer description | Output size
-------+--------+----------------------------------+---------------
| | input | (?, 416, 416, 3)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 416, 416, 16)
Load | Yep! | maxp 2x2p0_2 | (?, 208, 208, 16)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 208, 208, 32)
Load | Yep! | maxp 2x2p0_2 | (?, 104, 104, 32)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 104, 104, 64)
Load | Yep! | maxp 2x2p0_2 | (?, 52, 52, 64)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 52, 52, 128)
Load | Yep! | maxp 2x2p0_2 | (?, 26, 26, 128)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 26, 26, 256)
Load | Yep! | maxp 2x2p0_2 | (?, 13, 13, 256)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 512)
Load | Yep! | maxp 2x2p0_1 | (?, 13, 13, 512)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 1x1p0_1 linear | (?, 13, 13, 40)
-------+--------+----------------------------------+---------------
Running entirely on CPU
cfg/tiny-yolo-voc-3c.cfg loss hyper-parameters:
H = 13
W = 13
box = 5
classes = 3
scales = [1.0, 5.0, 1.0, 1.0]
Building cfg/tiny-yolo-voc-3c.cfg loss
Building cfg/tiny-yolo-voc-3c.cfg train op
2018-02-06 19:21:28.081406: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-06 19:21:28.081648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 0.928
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.78GiB
2018-02-06 19:21:28.081668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0, compute capability: 5.0)
Loading from ./ckpt/tiny-yolo-voc-3c-45500
Finished in 4.2130677700042725s

Enter training ...

cfg/tiny-yolo-voc-3c.cfg parsing train/annotations
Parsing for ['car', 'person', 'traffic_light']
[====================>]100% car_00000575.xmlml
Statistics:
traffic_light: 36
person: 787
car: 679
Dataset size: 1025
Dataset of 1025 instance(s)
Training statistics:
Learning rate : 1e-05
Batch size : 16
Epoch number : 1000
Backup every : 2000
step 45501 - loss nan - moving ave loss nan
step 45502 - loss nan - moving ave loss nan
step 45503 - loss nan - moving ave loss nan
step 45504 - loss nan - moving ave loss nan
```
I don't feel it's necessary to provide my config, as it's just a copy of tiny-yolo-voc.weights with the recommended changes to make it work with my custom dataset.

Let me know if I can provide any more details - hopefully we can get this issue sorted.

Might also be related to #383
Looking into the code for using yolo.cfg looks like the nan is coming straight out of the call to Tensorflow on line 56 of flow.py
fetched = self.sess.run(fetches, feed_dict) loss = fetched[1]
It also appears that yolo.cfg with coco labels calls the YOLOv2 constructor. The loss function for the network is given in "darkflow/net/yolov2/train.py" I don't see anything at first glance that looks like it could go NaN, but it's where we would have to start.

A lot of NaN problems can actually be fixed by clipping the gradients here:
https://github.com/thtrieu/darkflow/blob/master/darkflow/net/help.py#L18

Hi @thtrieu ,

I'm fairly new to darkflow and deep learning in general. By terms of clipping the gradients, how would you suppose I, a novice, go about this?

Any resources I can read or examples I can follow - I'm not exactly where to start with regards to clipping gradients with darkflow.

Thanks.

My problem at least has gone away. I was comparing yolo.cfg to yolo-voc.cfg, since I was able to successfully train on the voc dataset with yolo-voc. There is a convolutional layer in yolo.cfg that is not present in yolo-voc.cfg starting at line 211 in yolo.cfg. I removed this layer and edited the route layer (formerly at line 222, now at 214) to
[route] layers=-1,-3
I also changed random to 0 at the end of the file.
I am now training just fine. I can see zero gradients becoming an issue further along in training, but I have doubts that gradient clipping would have solved the "second iteration" issue that I was having. I personally do not know why this issue would have been caused by that convolutional layer. Closing for now, hopefully anyone else who has the immediate nan problem like me can edit similarly, though I would welcome commentary on why that particular layer caused me such troubles.

Hey @illQuo, if you are still unsure about how to clip gradients, here's how I did it:

gradients = optimizer.compute_gradients(self.framework.loss)
clipped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
self.train_op = optimizer.apply_gradients(gradients)

Even after clipped the gradients, the problem still persists. Can anyone suggest things I can try?

Was this page helpful?
0 / 5 - 0 ratings