python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\ pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval\
I am able to train with the object detecion API on my own dataset, which I created using the create_pascal_tf_record.py
script (I adjusted it a bit, but mainly the paths). I also checked the generated TFRecord files with the Tensorflow Testing module and verified, that the reconstructed images are similar to the original ones.
I use the existing faster_r-cnn_resnet101_voc07.config
file and only adjusted the paths and num_classes. The training runs like a charm, but when I start the eval.py script, it hangs with the message "INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my-net\models\faster_r-cnn_resnet101\train\model.ckpt-123805" (see full log below).
After this I have to CTRL+C
However, I can see some output in Tensorboard, but only one value after each CTRL+C
for mAP but nothing else in the other diagrams etc.
As mentioned by others having the same issue, running the evaluation and training parallel doesn't work for me, and I can't even imagine that it should be done this way. When I try it, my cuda crashed because the GPU runs out of memory.
Btw I also tried the whole evaluation process on the Oxford-IIIT Pet Dataset and am facing the same issue.
The whole log after I hit CTRL+C
(the part where it hangs is bold):
C:\Users\robin\models>python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval\
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
2017-08-16 07:40:03.000943: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001072: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001933: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002044: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002153: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002263: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002357: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002451: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.313527: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 6.00GiB
Free memory: 5.01GiB
2017-08-16 07:40:03.313690: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-08-16 07:40:03.314894: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-08-16 07:40:03.314995: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805
Traceback (most recent call last):
File "object_detection\eval.py", line 161, in
tf.app.run()
File "C:\Users\robin\AppData\Local\Programs\Python\Python35lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection\eval.py", line 157, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "C:\Users\robin\models\object_detection\evaluator.py", line 211, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "C:\Users\robin\models\object_detection\eval_util.py", line 524, in repeated_checkpoint_run
time.sleep(time_to_next_eval)
KeyboardInterrupt
@jch1 @tombstone @derekjchow @jesu9 @dreamdragon
I had the same issue. How many images do you have in your validation set? I can assume it is just too many. Try to reduce it to 10 or some other reasonable amount (in the config file) and check if it works. Do you actually see images in Tensorboard?
I have like 63 Images in the validation set and can see the Images in Tensorboard. But it still hangs.
I would say try with just a couple and see if it works. Or try to run on GPU only eval (this is involves some magic to point not to the last checkpoint)
By default, the eval
script runs forever. You'll need to define max_evals
in your eval proto in your train config proto.
Thus, it isn't hanging. It is just continually evaluating images, with sleeps in between. The sleep, by default, is 300 seconds.
You can set shuffle
to true
in your eval_input_reader
if you'd like to see different images on your images
tab in tensorboard.
@ncaadam so there is no output during the evaluation? The script ran during my lunch break (which was like 45 mins) and there still was no output. And when I look at some code in the eval_util.py I can see
logging information that do not get displayed in my Terminal.
Have you tried with like 5 images? You have to change eval_config, and see them in Tensorboard. Then wait till the new checkpoint appears. The eval.py takes apriori the last checkpoint in training and then as described before checks every 300 seconds if there is a new model saved. If you run evaluation after the training, it will hang because it will wait for the next checkpoint.
Oh I understand it now. So I have to run the eval and train parallel to get a continuous evaluation. The problem I have with this, is that my CUDA crashes because it runs out of GPU memory. I have decreased the size of the test from 20% to 5%, but it still crashes.
@failedmath what exactly does the num_examples
parameter? Does it randomly pick the specified amount of images from the test.tfrecord? So can I set this parameter to 5 and leave my test set with a value of 20% of the total data set size or do I have to reduce the overall size of my test set?
@RobinBaumann I run eval on my CPU, so it does not crash. For 10 images the config is:
eval_config: {
num_examples: 10
num_visualizations: 10
eval_interval_secs: 120
}
And the tfrecord file should be then created just from 10 images.
Okay thank you! This approach works for me.
I think it would be useful to add one or two sentences to the documentation or some additional logging information to the evaluation script, just in case someone else struggles on the same problem. I have seen some prior issues that mentioned similar problems.
@RobinBaumann, Hi, i meet the same problem with you. I download five pictures and save them in directory /models/valimg. And I use the existing faster_r-cnn_resnet101_voc07.config file and only adjusted the paths. Finally, i run python object_detection/eval.py \
--logtostderr \
--pipeline_config_path=./configfile/faster_rcnn_resnet101_voc07.config \
--checkpoint_dir=./faster_rcnn_resnet101_coco_11_06_2017 \
--eval_dir=./valimg
However, I can see some output in directory valimg
"-rw-r--r-- 1 joseph joseph 211856 9ๆ 6 10:50 events.out.tfevents.1504666240.ubuntu
-rw-r--r-- 1 joseph joseph 297676 9ๆ 6 10:51 events.out.tfevents.1504666266.ubuntu
-rw-r--r-- 1 joseph joseph 185722 9ๆ 6 10:51 events.out.tfevents.1504666290.ubuntu
-rw-r--r-- 1 joseph joseph 374962 9ๆ 6 10:51 events.out.tfevents.1504666318.ubuntu
-rw-r--r-- 1 joseph joseph 233590 9ๆ 6 10:52 events.out.tfevents.1504666342.ubuntu
-rw-r--r-- 1 joseph joseph 234716 9ๆ 6 10:52 events.out.tfevents.1504666368.ubuntu
-rw-r--r-- 1 joseph joseph 312911 9ๆ 6 10:53 events.out.tfevents.1504666393.ubuntu
-rw-r--r-- 1 joseph joseph 399273 9ๆ 6 10:53 events.out.tfevents.1504666421.ubuntu
-rw-r--r-- 1 joseph joseph 373139 9ๆ 6 10:54 events.out.tfevents.1504666447.ubuntu
-rw-r--r-- 1 joseph joseph 1633 9ๆ 6 10:54 events.out.tfevents.1504666472.ubuntu"
but only events.out.tfevents files, so i have to CTRL+C for mAP but nothing else in the other diagramsetc. I try to run tensorboard --logdir =./valimg, it displays nothing. So can you tell me what is wrong with my operation? thank you in advance.
Why the evaluation scripts doesn't output any information.
I find lots of logging
statements in the run_checkpoint_once
function.
Is there any examples about how to caculate roc or map?
@liuqi05 I don't know what is wrong with your configuration. Maybe try launching Tensorboard with a full path to your eval_dir?
Also my directory structure looks like this:
+project/
+data/ (dataset location)
+models/
+faster_r-cnn_resnet101/
+eval/
+inference/
+train/
+pipeline.config
And I launch tensorboard with the following command ( from project directory):
tensorboard --logdir=./models/faster_r-cnn_resnet101/
Maybe you have to set the parent directory of your valimg/ as logdir for Tensorboard?
@liuqi05 same problem with you
for train tensorboard displays output but for eval it displays nothing
@failedmath
I run eval on my CPU, so it does not crash
How do you restrict it to do that ? ๐
@cipri-tom At the beginning of the eval.py script you can set the CUDA_VISIBLE_DEVICES to empty in the following way:
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""
Then the evaluation will fall back to cpu execution.
@josifovski obviously! Why did I not think of that? >.< I am using CUDA_VISIBLE_DEVICES to set which GPU to use but I didn't think I could set it to null.
Thank you!
Same problem here. Although this could be easily fixed by one line code: logging.basicConfig(level=logging.DEBUG), I suppose Tensorflow people should fix this bug.
Can someone please explain how to run the evalutation from scratch? I want to find the mAP and IOU for a dataset which I have. The detection are happening fine, but I have no clue for evaluating the model. For now, all I am trying to do is
python eval.py --logtostderr --checkpoint_dir=data/model.ckpt-5575 --eval_dir=eval_model/ --pipeline_config_path=data/object-detect.pbtxt
and the object-detect.pbtxt contains
item {
id: 1
name: 'ball'
}
Can someone please let me know how to do this.?
@shreyas0906 the pipeline_config_path argument you provide is not correct, you shouldn't provide the pbtxt mapping file but the configuration file that you have used for training, e.g. faster_rcnn_inception_resnet_v2_atrous_coco.config from the provided tensorflow/models/research/object_detection/samples/configs which you have adapted for your purposes. Inside in the config file you should adapt the eval_input_reader element such that it shows to the tf record created from the images you want to evaluate.
@josifovski Thank you for the information! But, my next question after running the the eval.py, I am unable to find the mAP and IOU graph. I am able to view the Images and the detection on them, but unable to view the mAP and IOU graph. Can you please let me know how to achieve that.
Greatly appreciate your help!
I am facing this problem. Every thing works perfect for ssd _mobilenet even evaluation. For frcnn_resnet evaluation stop at this line
WARNING:root:image 0 does not have groundtruth difficult flag specified
I changed the eval_max to 1 and lower the number of samples still it hangs and dont move forward.
can any body help me solve this issue?
@Fahimkh interesting, because eval stops after roughly 9-10 images using ssd_mobilenet for me.
Looks like I've fixed it for myself.
In ssd_mobilenet_v1_pets.config
eval_config: {
num_examples: 186
num_visualizations: 186
}
add the line "num_visualizations" and set it to the number of examples you have (in my case 186).
Hi guys, wonder if you can help me out here... what am I doing wrong?
this is a copy of my cmd prompt://
C:\Users\Sawal\Desktop\models\models-master\research\object_detection>eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training --eval_dir=test
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-21 20:53:06.664120: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
WARNING:root:image 0 does not have groundtruth difficult flag specified
WARNING:root:The following classes have no ground truth examples: [2 4]
C:\Users\Sawal\Desktop\models\models-master\research\object_detection\utils\metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)
then nothing comes out .. continuous loop.. any help is appreciated
@mohdsubhi nothing is supposed to come out, IIRC. The evaluation results are published in summaries, which you can see in TensorBoard
@cipri-tom so basically I have to run the evaluation at the same time with training right ? or it doesn't matter?
preferably, yes, so you can get continuous evaluation and see how your model progresses
@cipri-tom "The evaluation results are published in summaries, which you can see in TensorBoard"
Where? I'm not seeing anything useful in tensorboard. I see some images, and the graph. I want to see metrics. I'm running eval.py
Many thanks in advance.
@cipri-tom I had the same problem. Can you tell me what suppose to be the output file in eval directory? I only had several even.out.tfevetsXXXX file and a pipeline.config, should be an eval.pbtxt file as output there?
Thank you!!
no, there's only *tfevents
files. These contain tensorBoard summaries. Fire up tensorboard --logdir /path/to/eval/dir
Hi,
I badly need help. I make the following changes in the sample config file:
train_config:
initial_learning_rate: 0.001 (# previously it was set to 0.004)
eval_config:
max_evals: 0 (previously it was set to 10)
To start with I ran the pretrained model exactly with the same configuration, Both the training job and evaluation job ran smoothly. But since, max_eval was set to 10, the evaluation job stopped. I want to run it for infinite time so that I can get the mAP measure. So, I changed it to 0 and trying to run it again. train.py is running smoothly, but eval.py is not running. Here is the copy of prompt:
/workspace/models/research/object_detection/utils/visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.
The backend was originally set to 'TkAgg' by the following code:
File "eval.py", line 50, in
from object_detection import evaluator
File "/workspace/models/research/object_detection/evaluator.py", line 24, in
from object_detection import eval_util
File "/workspace/models/research/object_detection/eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "/workspace/models/research/object_detection/metrics/coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "/workspace/models/research/object_detection/metrics/coco_tools.py", line 47, in
from pycocotools import coco
File "/workspace/models/research/pycocotools/coco.py", line 49, in
import matplotlib.pyplot as plt
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/__init__.py", line 16, in
line for line in traceback.format_stack()
import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-05-18 10:31:34.033209: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled t
o use: AVX2 FMA
2018-05-18 10:31:34.744555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but ther
e must be at least one NUMA node, so returning NUMA node zero
2018-05-18 10:31:34.744967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 5.58GiB
2018-05-18 10:31:34.745000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-18 10:31:35.262651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 10:31:35.262739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-05-18 10:31:35.262767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-05-18 10:31:35.263065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 w
ith 5338 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
WARNING:root:image 0 does not have groundtruth difficult flag specified
@priyakansal it is working, but you have warnings. Please use stackOverflow for support and add details regarding the mAP that you see in TensorBoard in both cases. I'll try to follow there
I have a similar problem. I'm using the Faster-RCNN Inception V2
pre-trained model and trained it on my own dataset with 24 classes.
When I run the eval.py script with the correct parameters, the process will eat up all available memory (I put limits to 12GB, 48GB, and 128GB; I do not have more) and then just hang at INFO:tensorflow:Restoring parameters from /data/train/model.ckpt-200000
for a few seconds and then just quit without further notification.
The eval dir contains only the pipeline.cfg file.
My test.record only has 34 images that are relatively small, so I don't think the size should matter. Also, it doesn't matter if I allow access to 0, 1, 2, or 3 CUDA cards.
Additionally, for training, it worked after giving it access to 128GB and it used all of it. I think there is something wrong with the memory consumption.
I'm running it inside the docker container of nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
using Python 3.5. I did not have any success with tensorflow/tensorflow:latest-gpu
with either Python 2 or Python 3.
@bermeitinger-b If you run out of RAM (as opposed to GPU memory) see here . It's a problem of queues.
I don't think so, I reduced it to one image per batch, a queue size of one and a prefetch size of one.
I have added some print statements to check where it crashes, and it is in the coco_evaluation.py
file, so after the network was loaded and the image has already been processed.
Although, it could be of the last thing that didn't fit into memory. I have set a limit of 124GB.
I could train successfully and also run inference on many images, I do not know why the evaluation requires so much memory.
So did anyone found solutions on how to make tensorboard show the scalars tab to see the evaluation metrics?
@priyakansal I also have the same problem and spent a whole to find the solution then it turn out that I just need a little bit patience.
@cipri-tom Thank you for your remainder, now I have my evaluation result and do you have a idea how to explain that i got classification_loss about 8 while localization_loss close 1.
where is explanations from localization_loss, classificatio, objectness... could i mix on tensorboard the training va and evaluation images values ?
I can only see image and graph tab. How to check the mAP values in scalar tab?
make train for example in /train and eval in /train/eval, then launch tensorboard with path /train, it will take both log events
anyone run the problem in eval.py:
Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacyevaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1281, in __init__
self.build()
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(args, *kwargs)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
i have a config file For faster_rcnn_resnet50_coco.config
and there is some line like
eval_config: { num_examples: 8000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 10 }
**what actually happened if i remove max_evals
Remove the below line to evaluate indefinitely.
what actually mean by indefinitely**
also for number of steps
thanks and best regards
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
i have a config file For faster_rcnn_resnet50_coco.config
and there is some line like
eval_config: { num_examples: 8000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 10 }
**what actually happened if i remove max_evals
Remove the below line to evaluate indefinitely.
what actually mean by indefinitely**
also for number of steps
never decay). Remove the below line to train indefinitely. num_steps: 200000 when should trainer stop if i remove num of steps line and what again mean by indefinitely
If you want to evaluate your model on validation data you should use:
python models/research/object_detection/model_main_tf2.py --pipeline_config_path=/path/to/pipeline_file --model_dir=/path/to/output_results --checkpoint_dir=/path/to/directory_holding_checkpoint --run_once=True
If you want to evaluate your model on training data, you should set 'eval_training_data' as True, that is:
python models/research/object_detection/model_main_tf2.py --pipeline_config_path=/path/to/pipeline_file --model_dir=/path/to/output_results --eval_training_data=True --checkpoint_dir=/path/to/directory_holding_checkpoint --run_once=True
I also add comments to clarify some of the previous options:
--pipeline_config_path: path to "pipeline.config" file used to train detection model. This file should include paths to the TFRecords files (train and test files) that you want to evaluate, i.e. :
train_input_reader: {
tf_record_input_reader {
#path to the training TFRecord
input_path: "/path/to/train.record"
}
#path to the label map
label_map_path: "/path/to/label_map.pbtxt"
}
...
eval_input_reader: {
tf_record_input_reader {
#path to the testing TFRecord
input_path: "/path/to/test.record"
}
#path to the label map
label_map_path: "/path/to/label_map.pbtxt"
}
--model_dir: Output directory where resulting metrics will be written, particularly "events.*" files that can be read by tensorboard.
--checkpoint_dir: Directory holding a checkpoint. That is the model directory where checkpoint files ("model.ckpt.*") has been written, either during the training process or after export it by using "export_inference_graph.py".
--run_once: True to run just one round of evaluation.
I found the answer here and it works for me.
Hope I can help a few that missed it like myself! Cheers,
Most helpful comment
@RobinBaumann I run eval on my CPU, so it does not crash. For 10 images the config is:
eval_config: {
num_examples: 10
num_visualizations: 10
eval_interval_secs: 120
}
And the tfrecord file should be then created just from 10 images.