In TF1.x we were able to run evaluation automatically during a training on a single GPU. Is it possible to achieve the same?
@turowicz
Can you please elaborate about the issue & the context.Thanks!
@ravikyram object_detection/model_main.py runs both training and evaluation as a single command, so you can leave it for the weekend and come back on monday to see the results. In contrast, model_main_tf2.py requires us to stop the training, run the evaulation manually and then restart the training. It makes the process impossible to be left alone for long periods of time.
@turowicz As a temporary workaround, you can run a second evaluation-only process in parallel via (in a separate shell):
CUDA_VISIBLE_DEVICES=-1 python object_detection/model_main_tf2.py --checkpoint_dir <same path as model_dir> --model_dir <the model_dir you passed in the training process> --pipeline_config_path <path to the pipeline.config file you're training with>
This will run on the CPU only, monitor checkpoint_dir (up to, by default, 1h. look at all the flags of model_main_tf2) and run an evaluation every time a new checkpoint is generated. The benefit of this vs the V1 approach is, training never stops. Of course, though, if you need to run eval on the GPU this will require further tweaking (especially if you only have 1 GPU and/or don't want to always dedicate one device to evaluation).
Hey @GPhilo !
I麓ve been searching for some information on this topic for some time.
Is this the only possible way to evaluate my model during training?
If so, can I change how often a checkpoint is generated so my model is evaluated more often?
@ItsMeTheBee
Is this the only possible way to evaluate my model during training?
Definitely not (you could implement your own evaluation, for example), but it's definitely the most practical.
If so, can I change how often a checkpoint is generated so my model is evaluated more often?
Yep, there's flags you can pass to model_main to configure how often checkpoints are created. IIRC there's also a flag for the evaluation process that sets the minimum time between evaluations, so you should make sure not to generate checkpoints too often (or remember to lower the minimum time, though I seem to remember it should be by default 10 minutes, which is really low already for all object detection models I can think of at the moment).
Thanks @GPhilo this should do it.
Most helpful comment
@turowicz As a temporary workaround, you can run a second evaluation-only process in parallel via (in a separate shell):
This will run on the CPU only, monitor
checkpoint_dir(up to, by default, 1h. look at all the flags ofmodel_main_tf2) and run an evaluation every time a new checkpoint is generated. The benefit of this vs the V1 approach is, training never stops. Of course, though, if you need to run eval on the GPU this will require further tweaking (especially if you only have 1 GPU and/or don't want to always dedicate one device to evaluation).