Models: Evaluation in Object Detection hanging

Created on 16 Aug 2017 · 46Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: tensorflow/models/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes (well, i actually just adjusted the pipeline config to fit my dataset)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 64bit
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.2.1
Bazel version (if compiling from source):
CUDA/cuDNN version: 8.0 /5.1
GPU model and memory: GeForce GTX1060 6GB
Exact command to reproduce: python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\ pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval\

Describe the problem

I am able to train with the object detecion API on my own dataset, which I created using the create_pascal_tf_record.py script (I adjusted it a bit, but mainly the paths). I also checked the generated TFRecord files with the Tensorflow Testing module and verified, that the reconstructed images are similar to the original ones.

I use the existing faster_r-cnn_resnet101_voc07.config file and only adjusted the paths and num_classes. The training runs like a charm, but when I start the eval.py script, it hangs with the message "INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my-net\models\faster_r-cnn_resnet101\train\model.ckpt-123805" (see full log below).

After this I have to CTRL+C

However, I can see some output in Tensorboard, but only one value after each CTRL+C for mAP but nothing else in the other diagrams etc.

As mentioned by others having the same issue, running the evaluation and training parallel doesn't work for me, and I can't even imagine that it should be done this way. When I try it, my cuda crashed because the GPU runs out of memory.

Btw I also tried the whole evaluation process on the Oxford-IIIT Pet Dataset and am facing the same issue.

Source code / logs

The whole log after I hit CTRL+C (the part where it hangs is bold):

C:\Users\robin\models>python object_detection\eval.py --logtostderr --pipeline_config_path=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\pipeline.config --checkpoint_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\ --eval_dir=C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\eval\
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
2017-08-16 07:40:03.000943: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001072: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.001933: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002044: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002153: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002263: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002357: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.002451: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 07:40:03.313527: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 6.00GiB
Free memory: 5.01GiB
2017-08-16 07:40:03.313690: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-08-16 07:40:03.314894: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-08-16 07:40:03.314995: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805
INFO:tensorflow:Restoring parameters from C:\Users\robin\PycharmProjects\test\my_net\models\faster_r-cnn_resnet101\train\model.ckpt-123805
Traceback (most recent call last):
File "object_detection\eval.py", line 161, in
tf.app.run()
File "C:\Users\robin\AppData\Local\Programs\Python\Python35lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection\eval.py", line 157, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "C:\Users\robin\models\object_detection\evaluator.py", line 211, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "C:\Users\robin\models\object_detection\eval_util.py", line 524, in repeated_checkpoint_run
time.sleep(time_to_next_eval)
KeyboardInterrupt

research awaiting model gardener support

Source

RobinBaumann

Most helpful comment

@RobinBaumann I run eval on my CPU, so it does not crash. For 10 images the config is:

eval_config: {
num_examples: 10
num_visualizations: 10
eval_interval_secs: 120
}

And the tfrecord file should be then created just from 10 images.

failedmath on 29 Aug 2017

👍11

All 46 comments

@jch1 @tombstone @derekjchow @jesu9 @dreamdragon

skye on 16 Aug 2017

I had the same issue. How many images do you have in your validation set? I can assume it is just too many. Try to reduce it to 10 or some other reasonable amount (in the config file) and check if it works. Do you actually see images in Tensorboard?

failedmath on 18 Aug 2017

I have like 63 Images in the validation set and can see the Images in Tensorboard. But it still hangs.

RobinBaumann on 18 Aug 2017

I would say try with just a couple and see if it works. Or try to run on GPU only eval (this is involves some magic to point not to the last checkpoint)

failedmath on 18 Aug 2017

By default, the eval script runs forever. You'll need to define max_evals in your eval proto in your train config proto.

Thus, it isn't hanging. It is just continually evaluating images, with sleeps in between. The sleep, by default, is 300 seconds.

You can set shuffle to true in your eval_input_reader if you'd like to see different images on your images tab in tensorboard.

ncaadam on 23 Aug 2017

👍9

@ncaadam so there is no output during the evaluation? The script ran during my lunch break (which was like 45 mins) and there still was no output. And when I look at some code in the eval_util.py I can see
logging information that do not get displayed in my Terminal.

RobinBaumann on 29 Aug 2017

Have you tried with like 5 images? You have to change eval_config, and see them in Tensorboard. Then wait till the new checkpoint appears. The eval.py takes apriori the last checkpoint in training and then as described before checks every 300 seconds if there is a new model saved. If you run evaluation after the training, it will hang because it will wait for the next checkpoint.

failedmath on 29 Aug 2017

Oh I understand it now. So I have to run the eval and train parallel to get a continuous evaluation. The problem I have with this, is that my CUDA crashes because it runs out of GPU memory. I have decreased the size of the test from 20% to 5%, but it still crashes.

@failedmath what exactly does the num_examples parameter? Does it randomly pick the specified amount of images from the test.tfrecord? So can I set this parameter to 5 and leave my test set with a value of 20% of the total data set size or do I have to reduce the overall size of my test set?

RobinBaumann on 29 Aug 2017

@RobinBaumann I run eval on my CPU, so it does not crash. For 10 images the config is:

eval_config: {
num_examples: 10
num_visualizations: 10
eval_interval_secs: 120
}

And the tfrecord file should be then created just from 10 images.

failedmath on 29 Aug 2017

👍11

Okay thank you! This approach works for me.

I think it would be useful to add one or two sentences to the documentation or some additional logging information to the evaluation script, just in case someone else struggles on the same problem. I have seen some prior issues that mentioned similar problems.

RobinBaumann on 29 Aug 2017

👍7

@RobinBaumann, Hi, i meet the same problem with you. I download five pictures and save them in directory /models/valimg. And I use the existing faster_r-cnn_resnet101_voc07.config file and only adjusted the paths. Finally, i run python object_detection/eval.py \
--logtostderr \
--pipeline_config_path=./configfile/faster_rcnn_resnet101_voc07.config \
--checkpoint_dir=./faster_rcnn_resnet101_coco_11_06_2017 \
--eval_dir=./valimg
However, I can see some output in directory valimg
"-rw-r--r-- 1 joseph joseph 211856 9月 6 10:50 events.out.tfevents.1504666240.ubuntu
-rw-r--r-- 1 joseph joseph 297676 9月 6 10:51 events.out.tfevents.1504666266.ubuntu
-rw-r--r-- 1 joseph joseph 185722 9月 6 10:51 events.out.tfevents.1504666290.ubuntu
-rw-r--r-- 1 joseph joseph 374962 9月 6 10:51 events.out.tfevents.1504666318.ubuntu
-rw-r--r-- 1 joseph joseph 233590 9月 6 10:52 events.out.tfevents.1504666342.ubuntu
-rw-r--r-- 1 joseph joseph 234716 9月 6 10:52 events.out.tfevents.1504666368.ubuntu
-rw-r--r-- 1 joseph joseph 312911 9月 6 10:53 events.out.tfevents.1504666393.ubuntu
-rw-r--r-- 1 joseph joseph 399273 9月 6 10:53 events.out.tfevents.1504666421.ubuntu
-rw-r--r-- 1 joseph joseph 373139 9月 6 10:54 events.out.tfevents.1504666447.ubuntu
-rw-r--r-- 1 joseph joseph 1633 9月 6 10:54 events.out.tfevents.1504666472.ubuntu"
but only events.out.tfevents files, so i have to CTRL+C for mAP but nothing else in the other diagramsetc. I try to run tensorboard --logdir =./valimg, it displays nothing. So can you tell me what is wrong with my operation? thank you in advance.

liuqi05 on 6 Sep 2017

Why the evaluation scripts doesn't output any information.
I find lots of logging statements in the run_checkpoint_once function.
Is there any examples about how to caculate roc or map?

auroua on 6 Sep 2017

@liuqi05 I don't know what is wrong with your configuration. Maybe try launching Tensorboard with a full path to your eval_dir?

Also my directory structure looks like this:

+project/

  +data/ (dataset location)

  +models/

    +faster_r-cnn_resnet101/

      +eval/

      +inference/

      +train/

      +pipeline.config

And I launch tensorboard with the following command ( from project directory):

tensorboard --logdir=./models/faster_r-cnn_resnet101/

Maybe you have to set the parent directory of your valimg/ as logdir for Tensorboard?

RobinBaumann on 7 Sep 2017

@liuqi05 same problem with you
for train tensorboard displays output but for eval it displays nothing

liu6381810 on 7 Sep 2017

@failedmath

I run eval on my CPU, so it does not crash

How do you restrict it to do that ? 😃

cipri-tom on 10 Oct 2017

@cipri-tom At the beginning of the eval.py script you can set the CUDA_VISIBLE_DEVICES to empty in the following way:

import os
os.environ["CUDA_VISIBLE_DEVICES"]=""

Then the evaluation will fall back to cpu execution.

josifovski on 13 Oct 2017

👍9

@josifovski obviously! Why did I not think of that? >.< I am using CUDA_VISIBLE_DEVICES to set which GPU to use but I didn't think I could set it to null.

Thank you!

cipri-tom on 13 Oct 2017

Same problem here. Although this could be easily fixed by one line code: logging.basicConfig(level=logging.DEBUG), I suppose Tensorflow people should fix this bug.

ybsave on 1 Nov 2017

👍2

Can someone please explain how to run the evalutation from scratch? I want to find the mAP and IOU for a dataset which I have. The detection are happening fine, but I have no clue for evaluating the model. For now, all I am trying to do is
python eval.py --logtostderr --checkpoint_dir=data/model.ckpt-5575 --eval_dir=eval_model/ --pipeline_config_path=data/object-detect.pbtxt

and the object-detect.pbtxt contains

item {
id: 1
name: 'ball'
}

Can someone please let me know how to do this.?

shreyas0906 on 21 Jan 2018

@shreyas0906 the pipeline_config_path argument you provide is not correct, you shouldn't provide the pbtxt mapping file but the configuration file that you have used for training, e.g. faster_rcnn_inception_resnet_v2_atrous_coco.config from the provided tensorflow/models/research/object_detection/samples/configs which you have adapted for your purposes. Inside in the config file you should adapt the eval_input_reader element such that it shows to the tf record created from the images you want to evaluate.

josifovski on 21 Jan 2018

@josifovski Thank you for the information! But, my next question after running the the eval.py, I am unable to find the mAP and IOU graph. I am able to view the Images and the detection on them, but unable to view the mAP and IOU graph. Can you please let me know how to achieve that.

Greatly appreciate your help!

shreyas0906 on 21 Jan 2018

I am facing this problem. Every thing works perfect for ssd _mobilenet even evaluation. For frcnn_resnet evaluation stop at this line
WARNING:root:image 0 does not have groundtruth difficult flag specified

I changed the eval_max to 1 and lower the number of samples still it hangs and dont move forward.
can any body help me solve this issue?

Fahimkh on 23 Jan 2018

@Fahimkh interesting, because eval stops after roughly 9-10 images using ssd_mobilenet for me.

ssk1991 on 24 Jan 2018

Looks like I've fixed it for myself.

In ssd_mobilenet_v1_pets.config

eval_config: {
num_examples: 186
num_visualizations: 186
}

add the line "num_visualizations" and set it to the number of examples you have (in my case 186).

ssk1991 on 26 Jan 2018

🎉5 ❤3

Hi guys, wonder if you can help me out here... what am I doing wrong?

this is a copy of my cmd prompt://

C:\Users\Sawal\Desktop\models\models-master\research\object_detection>eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training --eval_dir=test
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-21 20:53:06.664120: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
INFO:tensorflow:Restoring parameters from training\model.ckpt-2541
WARNING:root:image 0 does not have groundtruth difficult flag specified
WARNING:root:The following classes have no ground truth examples: [2 4]
C:\Users\Sawal\Desktop\models\models-master\research\object_detection\utils\metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)

then nothing comes out .. continuous loop.. any help is appreciated

mohdsubhi on 21 Feb 2018

@mohdsubhi nothing is supposed to come out, IIRC. The evaluation results are published in summaries, which you can see in TensorBoard

cipri-tom on 21 Feb 2018

@cipri-tom so basically I have to run the evaluation at the same time with training right ? or it doesn't matter?

mohdsubhi on 21 Feb 2018

preferably, yes, so you can get continuous evaluation and see how your model progresses

cipri-tom on 22 Feb 2018

@cipri-tom "The evaluation results are published in summaries, which you can see in TensorBoard"
Where? I'm not seeing anything useful in tensorboard. I see some images, and the graph. I want to see metrics. I'm running eval.py

Many thanks in advance.

evolu8 on 24 Feb 2018

@cipri-tom I had the same problem. Can you tell me what suppose to be the output file in eval directory? I only had several even.out.tfevetsXXXX file and a pipeline.config, should be an eval.pbtxt file as output there?
Thank you!!

tanndx17 on 28 Feb 2018

no, there's only *tfevents files. These contain tensorBoard summaries. Fire up tensorboard --logdir /path/to/eval/dir

cipri-tom on 28 Feb 2018

Hi,
I badly need help. I make the following changes in the sample config file:
train_config:
initial_learning_rate: 0.001 (# previously it was set to 0.004)
eval_config:
max_evals: 0 (previously it was set to 10)
To start with I ran the pretrained model exactly with the same configuration, Both the training job and evaluation job ran smoothly. But since, max_eval was set to 10, the evaluation job stopped. I want to run it for infinite time so that I can get the mAP measure. So, I changed it to 0 and trying to run it again. train.py is running smoothly, but eval.py is not running. Here is the copy of prompt:

/workspace/models/research/object_detection/utils/visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'TkAgg' by the following code:
File "eval.py", line 50, in
from object_detection import evaluator
File "/workspace/models/research/object_detection/evaluator.py", line 24, in
from object_detection import eval_util
File "/workspace/models/research/object_detection/eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "/workspace/models/research/object_detection/metrics/coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "/workspace/models/research/object_detection/metrics/coco_tools.py", line 47, in
from pycocotools import coco
File "/workspace/models/research/pycocotools/coco.py", line 49, in
import matplotlib.pyplot as plt
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/__init__.py", line 16, in
line for line in traceback.format_stack()

import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-05-18 10:31:34.033209: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled t
o use: AVX2 FMA
2018-05-18 10:31:34.744555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but ther
e must be at least one NUMA node, so returning NUMA node zero
2018-05-18 10:31:34.744967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 5.58GiB
2018-05-18 10:31:34.745000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-18 10:31:35.262651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 10:31:35.262739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-05-18 10:31:35.262767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-05-18 10:31:35.263065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 w
ith 5338 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
INFO:tensorflow:Restoring parameters from /workspace/models/research/object_detection/my_ssd_mobilenet_v2_coco_2017/train_check_point_exp4/model.ckpt-6737
WARNING:root:image 0 does not have groundtruth difficult flag specified

priyakansal on 18 May 2018

@priyakansal it is working, but you have warnings. Please use stackOverflow for support and add details regarding the mAP that you see in TensorBoard in both cases. I'll try to follow there

cipri-tom on 18 May 2018

I have a similar problem. I'm using the Faster-RCNN Inception V2 pre-trained model and trained it on my own dataset with 24 classes.
When I run the eval.py script with the correct parameters, the process will eat up all available memory (I put limits to 12GB, 48GB, and 128GB; I do not have more) and then just hang at INFO:tensorflow:Restoring parameters from /data/train/model.ckpt-200000 for a few seconds and then just quit without further notification.
The eval dir contains only the pipeline.cfg file.

My test.record only has 34 images that are relatively small, so I don't think the size should matter. Also, it doesn't matter if I allow access to 0, 1, 2, or 3 CUDA cards.

Additionally, for training, it worked after giving it access to 128GB and it used all of it. I think there is something wrong with the memory consumption.

I'm running it inside the docker container of nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 using Python 3.5. I did not have any success with tensorflow/tensorflow:latest-gpu with either Python 2 or Python 3.

bermeitinger-b on 18 Jun 2018

@bermeitinger-b If you run out of RAM (as opposed to GPU memory) see here . It's a problem of queues.

cipri-tom on 19 Jun 2018

I don't think so, I reduced it to one image per batch, a queue size of one and a prefetch size of one.
I have added some print statements to check where it crashes, and it is in the coco_evaluation.py file, so after the network was loaded and the image has already been processed.
Although, it could be of the last thing that didn't fit into memory. I have set a limit of 124GB.
I could train successfully and also run inference on many images, I do not know why the evaluation requires so much memory.

bermeitinger-b on 20 Jun 2018

So did anyone found solutions on how to make tensorboard show the scalars tab to see the evaluation metrics?

dragan-apostolski on 21 Jun 2018

👍1

@priyakansal I also have the same problem and spent a whole to find the solution then it turn out that I just need a little bit patience.
@cipri-tom Thank you for your remainder, now I have my evaluation result and do you have a idea how to explain that i got classification_loss about 8 while localization_loss close 1.

Cheren15 on 22 Jun 2018

where is explanations from localization_loss, classificatio, objectness... could i mix on tensorboard the training va and evaluation images values ?

leccyril on 8 Jul 2018

I can only see image and graph tab. How to check the mAP values in scalar tab?

alvinxiii on 5 Aug 2018

make train for example in /train and eval in /train/eval, then launch tensorboard with path /train, it will take both log events

leccyril on 5 Aug 2018

👍2 👎1

anyone run the problem in eval.py:
Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacyevaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1281, in __init__
self.build()
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(args, *kwargs)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venvlib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

cjr0106 on 29 Sep 2018

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

i have a config file For faster_rcnn_resnet50_coco.config

and there is some line like

eval_config: { num_examples: 8000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 10 }

**what actually happened if i remove max_evals

Remove the below line to evaluate indefinitely.
what actually mean by indefinitely**

also for number of steps

never decay). Remove the below line to train indefinitely. num_steps: 200000 when should trainer stop if i remove num of steps line and what again mean by indefinitely

thanks and best regards

DevLob-zz on 2 Feb 2020

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

i have a config file For faster_rcnn_resnet50_coco.config

and there is some line like

eval_config: { num_examples: 8000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 10 }

**what actually happened if i remove max_evals

Remove the below line to evaluate indefinitely.
what actually mean by indefinitely**

also for number of steps

never decay). Remove the below line to train indefinitely. num_steps: 200000 when should trainer stop if i remove num of steps line and what again mean by indefinitely

DevLob-zz on 2 Feb 2020

If you want to evaluate your model on validation data you should use:

python models/research/object_detection/model_main_tf2.py --pipeline_config_path=/path/to/pipeline_file --model_dir=/path/to/output_results --checkpoint_dir=/path/to/directory_holding_checkpoint --run_once=True

If you want to evaluate your model on training data, you should set 'eval_training_data' as True, that is:

python models/research/object_detection/model_main_tf2.py --pipeline_config_path=/path/to/pipeline_file --model_dir=/path/to/output_results --eval_training_data=True --checkpoint_dir=/path/to/directory_holding_checkpoint --run_once=True

I also add comments to clarify some of the previous options:

--pipeline_config_path: path to "pipeline.config" file used to train detection model. This file should include paths to the TFRecords files (train and test files) that you want to evaluate, i.e. :

    train_input_reader: {
        tf_record_input_reader {
                #path to the training TFRecord
                input_path: "/path/to/train.record"
        }
        #path to the label map 
        label_map_path: "/path/to/label_map.pbtxt"
    }
    ...
    eval_input_reader: {
        tf_record_input_reader {
            #path to the testing TFRecord
            input_path: "/path/to/test.record"
        }
        #path to the label map 
        label_map_path: "/path/to/label_map.pbtxt"
    }

--model_dir: Output directory where resulting metrics will be written, particularly "events.*" files that can be read by tensorboard.

--checkpoint_dir: Directory holding a checkpoint. That is the model directory where checkpoint files ("model.ckpt.*") has been written, either during the training process or after export it by using "export_inference_graph.py".

--run_once: True to run just one round of evaluation.

I found the answer here and it works for me.
Hope I can help a few that missed it like myself! Cheers,