Models: Evaluation during training

Created on 7 Jul 2017 · 10Comments · Source: tensorflow/models

Hi,

is there a way to run model evaluation during training? It seems to me that I need to stop training process and run evaluation on checkpoints in train directory to get some metrics. It would be more elegant to run the evaluation process after X number of training steps.

help wanted feature

Source

antoniosimunovic

👍10

Most helpful comment

you can run eval.py using CPU by adding

import os
os.environ["CUDA_VISIBLE_DEVICES"]=""

xiaoyuliu on 24 Jan 2018

👍5

All 10 comments

I just ran the eval on a separate terminal window without having to cancel the training

eshirima on 7 Jul 2017

Thanks for the tip, I am aware that it is possible to run evaluation in a different process if there is enough memory available. My point is that periodic model evaluation during training is common practice and it would be great to see how it can be implemented in this pipeline.

antoniosimunovic on 7 Jul 2017

Is there a particular model you have in mind? And is there a particular set of TF libraries that you're using to help you with training?

In general, might this be a better discussion for StackOverflow? For example: https://stackoverflow.com/questions/41217953/tensorflow-evaluate-while-training-with-queues
https://stackoverflow.com/questions/42407483/tensorflow-supervisor-for-both-training-and-evaluating-operations

cy89 on 8 Jul 2017

👎5

For training I'm using standard mobilenet v1 ssd configuration for object detection on a custom training and validation dataset.

As for evaluation during training I was thinking of something along training procedure at https://github.com/davidsandberg/facenet/blob/master/src/train_softmax.py

Thanks for the links, it seems that preferred approach is to run evaluation in a separate process.

antoniosimunovic on 10 Jul 2017

@antoniosimunovic have you solved the memory problem then ? Or found another solution ?

gdelab on 10 Jul 2017

Currently I'm running evaluation on a seperate device in a different
process as recommended.

2017-07-10 18:43 GMT+02:00 gdelab notifications@github.com:

@antoniosimunovic https://github.com/antoniosimunovic have you solved
the memory problem then ? Or found another solution ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/1885#issuecomment-314164016,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGIgjN0_STaWpNL7eqKkVZH7T3tiEyJ-ks5sMlS4gaJpZM4OQ1fh
.

antoniosimunovic on 10 Jul 2017

You'd need to modify the code to do evaluation. Many people do evaluation as separate processes because it reduces the total training time by not putting an evaluation step in the middle of the training loop, but it requires another device or enough memory to run on the same device. You might try stackoverflow to see if anyone has a patch to do this, but if you extend the model to allow this option, we'd be happy to accept it.

aselle on 10 Jul 2017

you can run eval.py using CPU by adding

import os
os.environ["CUDA_VISIBLE_DEVICES"]=""

xiaoyuliu on 24 Jan 2018

👍5

Thanks for the tip, I am aware that it is possible to run evaluation in a different process if there is enough memory available. My point is that periodic model evaluation during training is common practice and it would be great to see how it can be implemented in this pipeline.

How can I train the model using GPU,and valuate the model using CPU? but the train.py and eval.py are in the same pycharm project? I have tried "with tf.device('/gpu:0')" in train.py,and with tf.device('/cpu:0')" in eval.py,but it does not work.
It seems that the my environment does not support the tensorflow-CPU and tensorflow-GPU at the same time.

huiMM on 1 Nov 2018

👍1

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.