Models: Error when attempting to try continue training of object detection model

Created on 20 Jun 2017 · 29Comments · Source: tensorflow/models

I am trying to retrain one of the vision models on a new task, sadly it will sometimes work for 1-2 epochs (sometimes it skips to the error) before giving the following error:

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
     [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]

I'm not suring what I am doing wrong. [1, 63, 4] is very strange as it changes values between runs.

Source

ghost

Most helpful comment

I got the same error after adding possible labels to the annotations and label map without changing num_classes in the config file. Changing it seems to have fixed it (it now trains much longer without error at least).

gdelab on 23 Aug 2017

👍4

All 29 comments

Hi @conner-starsky - Can you say a bit about what your new task it? These incompatible shape errors often mean that the non max suppression op returns fewer boxes than the training code expects --- hard to say more without knowing more details though.

jch1 on 20 Jun 2017

Sure @jch1, thank you for the quick response. I am retraining on the Udacity dataset. I have tried disabling the non-max suppression op, but it doesn't seem to fix the problem. Again, thank you so much for your help.

ghost on 20 Jun 2017

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,61,4] vs. [1,64,4]
     [[Node: Loss/BoxClassifierLoss/Loss/sub = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Loss/BoxClassifierLoss/Reshape_9, Loss/BoxClassifierLoss/stack_4)]]

Is the error I get after disabling the non-max suppression op, it seems about the same.

ghost on 20 Jun 2017

Well you will need the NMS (and actually there are two such ops in Faster RCNN, which it looks like you are using) - usually these problems point to poor initialization or malformed data. Are you initializing appropriately from a COCO checkpoint? (Maybe you would like to provide logs and a config file?) Are you properly normalizing the bounding boxes relative to image size when preparing training data?

jch1 on 20 Jun 2017

The logs can be found here: https://gist.github.com/conner-starsky/5ff773cdadda6e1653ad26c123cafa64 and the config file here: https://gist.github.com/conner-starsky/f2a2c2724dfd218232c289e5481de60b I am dividing the bounding boxes by the image size. I can also post the data processing script if you would like.

ghost on 20 Jun 2017

I'm not sure, to be honest. What's your label map? And how big is the dataset?

My inclination would be to try first training an SSD model which won't run into this issue.

jch1 on 20 Jun 2017

The dataset is 65,000 labels across 9,423 frames each image is 1920x1200. My label map is:

item {
  id: 0
  name: 'Car'
}

item {
  id: 1
  name: 'Truck'
}

item {
  id: 2
  name: 'Pedestrian'
}

I will try to train a SSD model. Thank you again for your help and patience.

ghost on 20 Jun 2017

I get a different error on SSD...

ghost on 20 Jun 2017

Okay, so one assumption we make (and should document) is that label maps start at 1 --- anything less is considered to be background or "none of the above". In your case, it's ignoring all cars --- now I don't if this is the only problem, but try adding 1 to each item id, and running Faster RCNN again.

jch1 on 20 Jun 2017

Perhaps it was coincidence, but it seems to work for the longest time and was able to get to epoch 5 without erroring. However, the original error returned. :cry: Anyway, nice catch with the label map change, seems like that will matter very much it I ever get it working. :wink:

ghost on 20 Jun 2017

Do you feel that by epoch 5 the training loss has gone down and the test mAP has increased?

jch1 on 20 Jun 2017

I will keep running it to try to get a "lucky" run (I didn't keep my terminal windows open :skull:), ~~but for now:~~

INFO:tensorflow:global step 2: loss = 2.6126 (31.942 sec/step)
INFO:tensorflow:global step 3: loss = 2.0596 (19.264 sec/step)

Updated:

INFO:tensorflow:global step 2: loss = 2.6851 (26.775 sec/step)
INFO:tensorflow:global step 3: loss = 2.2413 (13.241 sec/step)
INFO:tensorflow:global step 4: loss = 2.7821 (13.341 sec/step)

Updated More:

INFO:tensorflow:global step 2: loss = 2.8193 (23.220 sec/step)
INFO:tensorflow:global step 3: loss = 2.7664 (15.656 sec/step)
INFO:tensorflow:global step 4: loss = 2.5771 (15.777 sec/step)

Last Update (got 5 steps! :+1:)

INFO:tensorflow:global step 2: loss = 2.8563 (22.455 sec/step)
INFO:tensorflow:global step 3: loss = 2.2297 (13.070 sec/step)
INFO:tensorflow:global step 4: loss = 2.2287 (13.068 sec/step)
INFO:tensorflow:global step 5: loss = 1.9266 (13.113 sec/step)

~~I will update after more attempts.~~ It seems that the loss does (in general) decrease. Thanks for your help in trying to figure this all out.

ghost on 20 Jun 2017

I misunderstood when you said "epoch" (I usually use that to refer to a full pass through the dataset). BTW, I recommend training with a GPU as it will be painfully slow without one :)

One thing to try is to vary learning rates (e.g. decrease them by some factor and see if it helps).

jch1 on 20 Jun 2017

Yeah, I didn't use epoch correctly at all. I just wanted to point out that it does seem to get a few batches (I think this is correct?) in before breaking.

I do have a large GPU training rig that I normally train on :smile:, but I am at home currently and just using my laptop to try to debug the error.

I cannot vary the learning rates if I keep getting the dimensional mismatch error, before I can see if anything is working (but I did try super small learning rates and get the same error).

ghost on 20 Jun 2017

Another ideas (I have no clue if this could be it), I haven't defined:

'image/object/difficult':
'image/object/truncated':
'image/object/view':

As I didn't think I needed them for my task. Could this be a problem?

ghost on 20 Jun 2017

Also, I am running the model on four titan X GPUs, but finding I cannot increase the batch size past 5 or I get out of memory errors. Is this expected behavior? Seems very low.

ghost on 20 Jun 2017

@conner-starsky you do not need those additional fields. Depending on the config you run, you will get different batch size constraints. For SSD, we can often run at batch size 32. Faster R-CNN we typically train at much higher resolutions which means that we can process a lot fewer per batch. This is why we usually just use batch size 1 for Faster R-CNN.

jch1 on 20 Jun 2017

Thank you so much. It seems to be running now? I think the problem was both the label issue along with the need to delete all of the checkpoint files. One last question and I promise to stop bugging you :wink:. When is evaluation ran for a model being trained?

ghost on 20 Jun 2017

Congrats! Hope you get great results :) So you are running train.py only right now. What you will have to do is run eval.py - see here:
https://github.com/tensorflow/models/blob/master/object_detection/g3doc/running_locally.md

Or if you try out the cloud walkthrough:
https://github.com/tensorflow/models/blob/master/object_detection/g3doc/running_pets.md
both jobs get launched simultaneously.

jch1 on 20 Jun 2017

🎉1

Thank you so much for you help. You were both kind and helpful, and it really made my day to get this to work.

ghost on 20 Jun 2017

Hi @jch1, I seem to have the same issue when using R-FCN configuration with my own dataset. The occurrence of the error is random. I did the following checks / tests:

I did check the labeling id, no issue (starting at 1, correct number of classes)
I changed the first stage iou threshold (higher), max proposals
I changed the second stage iou threshold (higher)

Still "Incompatible shapes: [1,63,4] vs. [1,64,4]" where 63 can be 61, 62 but lower than 64

ericj974 on 23 Jun 2017

@ericj974 Something that worked for me was clearing all checkpoint and event files (including the inital checkpoint).

ghost on 23 Jun 2017

@conner-starsky Thanks for the advice. By doing some classes aggregation (extreme case would be a single traffic sign class), the error seems to disappear which leads me to believe that @jch1 comment is correct: at some point of time the number of boxes that is generated is fewer than what the training code expects when my number of classes is "too high"

ericj974 on 24 Jun 2017

I also have this issue when finetuning on top of the resnet v2 atrous coco checkpoint. Seems very random.

My current best guess is bad data. It will take me some time to hunt down possible corrupt samples. Anyone else found anything on this yet?

evolu8 on 17 Jul 2017

@evolu8 On my side, the error was thrown at second_stage_loc_losses computation stage in faster_rcnn_meta_arch. Testing the shape of reshaped_refined_box_encodings versus batch_reg_targets before computing the loss and taking reshaped_refined_box_encodings = batch_reg_targets if shapes are different was a quick and dirty way to partially solve the issue.

ericj974 on 19 Jul 2017

@ericj974 care to share your code? I'm dealing with the same issue and trying to get to the bottom of it.

yhenon on 25 Jul 2017

I've been fighting this bug since last 3 days, One learning i've had is that as i move bounding boxes away from image edges, the chance of crash reduces significantly. Now i might try increasing size of image etc,

prashantmaurice on 10 Aug 2017

gdelab on 23 Aug 2017

👍4

I ran into this issue in the case that I had multiple bounding boxes per image, but did not have a matching number of classes and class_text in my tfrecord.

Per:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md

classes_text = [] # List of string class name of bounding box (1 per box)
classes = [] # List of integer class id of bounding box (1 per box)