Models: Training Inception from scratch on Custom Data: InvalidArgumentError: indices[0] = [0,1271] is out of bounds: need 0 <= index < [32,4]

Created on 19 Oct 2016 · 10Comments · Source: tensorflow/models

I'm trying to do a test run of the training process on a subset of my data before I attempt to train on the full set. I have 4 labels, and 981 images total. I generated the tf.Records (4 shards) with build_image_data.py with only minor problems (some images had a .jpg extension but were secretly .png's, but I wrote a check to convert those).

I ran bazel build inception/imagenet_train, and then I updated imagenet_data.py to set num_classes and num_examples to be 4, and 981 respectively.

When I try to run bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data, I'm getting the error I posted in the title (and a more complete traceback follows). Googling seems to suggest that this error arises when num_classes or num_examples is not set, but I've definitely done that. Did I set them incorrectly?

W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
E tensorflow/core/client/tensor_c_api.cc:485] indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
         [[Node: tower_0/SparseToDense = SparseToDense[T=DT_FLOAT, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/concat, tower_0/SparseToDense/output_shape, tower_0/SparseToDense/sparse_values, tower_0/SparseToDense/default_value)]]
Traceback (most recent call last):
  File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 730, in _do_call
    return fn(*args)
  File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in _run_fn
    status, run_metadata)
  File "/Users/work/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InvalidArgumentError: indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
         [[Node: tower_0/SparseToDense = SparseToDense[T=DT_FLOAT, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/concat, tower_0/SparseToDense/output_shape, tower_0/SparseToDense/sparse_values, tower_0/SparseToDense/default_value)]]

awaiting response

Source

samspills

Most helpful comment

@anuj2rock I just pushed a clone of the model repo with my custom dataset code: repo here

I put some simple instructions in the readme that should get something running. Note that the underlying model repo isn't up-to-date.

Hope this will help!

samspills on 15 Dec 2016

😄1 👍1

All 10 comments

Did you try to clear all output and try again?

drpngx on 24 Oct 2016

I did try that, but that didn't solve my issue. Ultimately, I switched to using TF-Slim, and defined a custom dataset to train from and that does work.

samspills on 24 Oct 2016

👍1

Just realized that I should close this, since it's not really relevant anymore.

samspills on 29 Oct 2016

@samspills can you pls share brief steps how you used TF-Slim for this purpose. I am stuck at training from scratch itself with this issue currently.

anuj2rock on 15 Dec 2016

@anuj2rock I just pushed a clone of the model repo with my custom dataset code: repo here

I put some simple instructions in the readme that should get something running. Note that the underlying model repo isn't up-to-date.

Hope this will help!

samspills on 15 Dec 2016

😄1 👍1

@samspills thanks for this post. Will try it today. It should help.

anuj2rock on 16 Dec 2016

@samspills Thanks alot, That worked, after a bit of API related corrections (tf.summary.scalar etc. etc.)

Now can you please help me in making predictions on my test images using this newly trained model?
Problem: No .pb file in train_dir after training completes!!
I just finished training inceptionv3 from scratch on my custom dataset(1675 train images, 400 validation images, 2 classes).

1) I don't know how to make predictions on my test images using my newly trained model.(where to point label_image.py for model)
2) Where did my newly trained model got saved?

Following some meta data about my setup/run:---

I got these files generated in train_dir:-
checkpoint (537bytes)
events.out.tfevents.1481980070.airig-Inspiron-7559(4.9GB)
graph.pbtxt(18.5MB)
and a bunch of model.ckpt-.meta and model.ckpt-.index files

After running train script I got:-

....
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

After running eval script I got:--

...
INFO:tensorflow:Evaluation [0/25]
INFO:tensorflow:Evaluation [1/25]
INFO:tensorflow:Evaluation [2/25]
INFO:tensorflow:Evaluation [3/25]
INFO:tensorflow:Evaluation [5/25]
INFO:tensorflow:Evaluation [5/25]
INFO:tensorflow:Evaluation [6/25]
INFO:tensorflow:Evaluation [7/25]
INFO:tensorflow:Evaluation [8/25]
INFO:tensorflow:Evaluation [9/25]
INFO:tensorflow:Evaluation [10/25]
INFO:tensorflow:Evaluation [11/25]
INFO:tensorflow:Evaluation [13/25]
INFO:tensorflow:Evaluation [13/25]
INFO:tensorflow:Evaluation [14/25]
INFO:tensorflow:Evaluation [15/25]
INFO:tensorflow:Evaluation [16/25]
INFO:tensorflow:Evaluation [17/25]
INFO:tensorflow:Evaluation [18/25]
INFO:tensorflow:Evaluation [19/25]
INFO:tensorflow:Evaluation [20/25]
INFO:tensorflow:Evaluation [21/25]
INFO:tensorflow:Evaluation [22/25]
INFO:tensorflow:Evaluation [23/25]
INFO:tensorflow:Evaluation [25/25]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Recall@5[1]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Accuracy[1]
INFO:tensorflow:Finished evaluation at 2016-12-19-03:59:04

anuj2rock on 19 Dec 2016

The way I use TF-Slim doesn't use a .pb file, but instead uses a checkpoint file. TF-Slim has an arg-scope for inception that you can use, and then you can restore from the checkpoint (I don't have an example of this I can give you at the moment). I think tensorflow has a freeze_graph script to save a checkpoint file to a graph, but I've never tried using that.

samspills on 20 Dec 2016

@anuj2rock have you solved making prediction for test images?

gulll on 11 Jan 2017

@anuj2rock I also follow @samspills guide but my code fails on eval (throws this error:

InvalidArgumentError (see above for traceback): targets[0] is out of range
     [[Node: InTopK = InTopK[T=DT_INT64, k=5, _device="/job:localhost/replica:0/task:0/cpu:0"](InceptionV3/Logits/SpatialSqueeze/_1717, Squeeze/_1719)]]

I thought it was a deal with running top5k with only 2 classes (that might be why you get 100% accuracy) but I am not so sure now. Did you change anything on the eval script?