I'm trying to do a test run of the training process on a subset of my data before I attempt to train on the full set. I have 4 labels, and 981 images total. I generated the tf.Records (4 shards) with build_image_data.py with only minor problems (some images had a .jpg extension but were secretly .png's, but I wrote a check to convert those).
I ran bazel build inception/imagenet_train, and then I updated imagenet_data.py to set num_classes and num_examples to be 4, and 981 respectively.
When I try to run bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data, I'm getting the error I posted in the title (and a more complete traceback follows). Googling seems to suggest that this error arises when num_classes or num_examples is not set, but I've definitely done that. Did I set them incorrectly?
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
E tensorflow/core/client/tensor_c_api.cc:485] indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
[[Node: tower_0/SparseToDense = SparseToDense[T=DT_FLOAT, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/concat, tower_0/SparseToDense/output_shape, tower_0/SparseToDense/sparse_values, tower_0/SparseToDense/default_value)]]
Traceback (most recent call last):
File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 730, in _do_call
return fn(*args)
File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in _run_fn
status, run_metadata)
File "/Users/work/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/Users/work/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InvalidArgumentError: indices[0] = [0,755] is out of bounds: need 0 <= index < [32,5]
[[Node: tower_0/SparseToDense = SparseToDense[T=DT_FLOAT, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/concat, tower_0/SparseToDense/output_shape, tower_0/SparseToDense/sparse_values, tower_0/SparseToDense/default_value)]]
Did you try to clear all output and try again?
I did try that, but that didn't solve my issue. Ultimately, I switched to using TF-Slim, and defined a custom dataset to train from and that does work.
Just realized that I should close this, since it's not really relevant anymore.
@samspills can you pls share brief steps how you used TF-Slim for this purpose. I am stuck at training from scratch itself with this issue currently.
@anuj2rock I just pushed a clone of the model repo with my custom dataset code: repo here
I put some simple instructions in the readme that should get something running. Note that the underlying model repo isn't up-to-date.
Hope this will help!
@samspills thanks for this post. Will try it today. It should help.
@samspills Thanks alot, That worked, after a bit of API related corrections (tf.summary.scalar etc. etc.)
Now can you please help me in making predictions on my test images using this newly trained model?
Problem: No .pb file in train_dir after training completes!!
I just finished training inceptionv3 from scratch on my custom dataset(1675 train images, 400 validation images, 2 classes).
1) I don't know how to make predictions on my test images using my newly trained model.(where to point label_image.py for model)
2) Where did my newly trained model got saved?
Following some meta data about my setup/run:---
I got these files generated in train_dir:-
checkpoint (537bytes)
events.out.tfevents.1481980070.airig-Inspiron-7559(4.9GB)
graph.pbtxt(18.5MB)
and a bunch of model.ckpt-.meta and model.ckpt-.index files
After running train script I got:-
....
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
After running eval script I got:--
...
INFO:tensorflow:Evaluation [0/25]
INFO:tensorflow:Evaluation [1/25]
INFO:tensorflow:Evaluation [2/25]
INFO:tensorflow:Evaluation [3/25]
INFO:tensorflow:Evaluation [5/25]
INFO:tensorflow:Evaluation [5/25]
INFO:tensorflow:Evaluation [6/25]
INFO:tensorflow:Evaluation [7/25]
INFO:tensorflow:Evaluation [8/25]
INFO:tensorflow:Evaluation [9/25]
INFO:tensorflow:Evaluation [10/25]
INFO:tensorflow:Evaluation [11/25]
INFO:tensorflow:Evaluation [13/25]
INFO:tensorflow:Evaluation [13/25]
INFO:tensorflow:Evaluation [14/25]
INFO:tensorflow:Evaluation [15/25]
INFO:tensorflow:Evaluation [16/25]
INFO:tensorflow:Evaluation [17/25]
INFO:tensorflow:Evaluation [18/25]
INFO:tensorflow:Evaluation [19/25]
INFO:tensorflow:Evaluation [20/25]
INFO:tensorflow:Evaluation [21/25]
INFO:tensorflow:Evaluation [22/25]
INFO:tensorflow:Evaluation [23/25]
INFO:tensorflow:Evaluation [25/25]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Recall@5[1]
I tensorflow/core/kernels/logging_ops.cc:79] eval/Accuracy[1]
INFO:tensorflow:Finished evaluation at 2016-12-19-03:59:04
The way I use TF-Slim doesn't use a .pb file, but instead uses a checkpoint file. TF-Slim has an arg-scope for inception that you can use, and then you can restore from the checkpoint (I don't have an example of this I can give you at the moment). I think tensorflow has a freeze_graph script to save a checkpoint file to a graph, but I've never tried using that.
@anuj2rock have you solved making prediction for test images?
@anuj2rock I also follow @samspills guide but my code fails on eval (throws this error:
InvalidArgumentError (see above for traceback): targets[0] is out of range
[[Node: InTopK = InTopK[T=DT_INT64, k=5, _device="/job:localhost/replica:0/task:0/cpu:0"](InceptionV3/Logits/SpatialSqueeze/_1717, Squeeze/_1719)]]
I thought it was a deal with running top5k with only 2 classes (that might be why you get 100% accuracy) but I am not so sure now. Did you change anything on the eval script?
Most helpful comment
@anuj2rock I just pushed a clone of the model repo with my custom dataset code: repo here
I put some simple instructions in the readme that should get something running. Note that the underlying model repo isn't up-to-date.
Hope this will help!