Models: Controlling number and frequency of checkpointing for Object Detection API

Created on 21 Jul 2020  路  2Comments  路  Source: tensorflow/models

Prerequisites

Please answer the following question for yourself before submitting an issue.

  • [x] I checked to make sure that this feature has not been requested already.

Well, there are a couple other issues #4636, #5139 discussing the same feature, but I can't find any clear instructions/solutions.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf1.md

2. Describe the feature you request

Two questions:

  1. Can we control the frequency of checkpointing?
  2. Can we control how many checkpoints are saved? Currently it seems like only the latest few checkpoints are saved when training, while the earlier checkpoints are automatically deleted. This defeats the purpose of checkpointing to some extent, as often users would want to train for many epochs and then find/use the checkpoint just before overfitting.

Is this already possible? If so, could clear instructions for these be added to the documentation?

3. Additional context

I'm using TF1 and particularly interested in instructions for TF1, but adding documentation for both TF1 and TF2 would be the ideal.

4. Are you willing to contribute it? (Yes or No)

May be.

research feature

Most helpful comment

Right now I use TF2 and what I have modified to save the ckpt is in the "model_main_tf2.py" file by adding to the model_lib_v2.train_loop call the following: checkpoint_every_n=FLAGS.num_train_steps. This way I save the ckpt once I've finished training the steps I've selected. Also, in the file "model_lib_v2.py" the function train_loop has the parameter checkpoint_max_to_keep with a value of 7 that you can extend. I hope I helped.

All 2 comments

Right now I use TF2 and what I have modified to save the ckpt is in the "model_main_tf2.py" file by adding to the model_lib_v2.train_loop call the following: checkpoint_every_n=FLAGS.num_train_steps. This way I save the ckpt once I've finished training the steps I've selected. Also, in the file "model_lib_v2.py" the function train_loop has the parameter checkpoint_max_to_keep with a value of 7 that you can extend. I hope I helped.

+1
Also I noticed that the saved checkpoints are not corresponding the actual training steps but increment of 1,2,3,.. and so on based on the number of time checkpoint criteria is met. For the purpose of exporting the right model at the end of training, it is better to have checkpoint step corresponding to the training step.

Was this page helpful?
0 / 5 - 0 ratings