Models: [object_detection] Feature: Resume training from last checkpoint

Created on 28 Apr 2018  路  13Comments  路  Source: tensorflow/models

This is about the object_detection repo:

As far as i know, when training breaks because of any kind of error and you want to continue training from the last saved checkpoint you need to manually adress the checkpoint from where you want to resume training in your models config file.

So now my question: Is there a way to let the model find the last saved checkpoint and continue from it automatically? If not I think it would be a nice feature to add in train protos and should not be too hard to implement.

Maybe something like this (taken from matterport):

def find_last(self):
        """Finds the last checkpoint file of the last trained model in the
        model directory.
        Returns:
            log_dir: The directory where events and weights are saved
            checkpoint_path: the path to the last checkpoint file
        """
        # Get directory names. Each directory corresponds to a model
        dir_names = next(os.walk(self.model_dir))[1]
        key = self.config.NAME.lower()
        dir_names = filter(lambda f: f.startswith(key), dir_names)
        dir_names = sorted(dir_names)
        if not dir_names:
            return None, None
        # Pick last directory
        dir_name = os.path.join(self.model_dir, dir_names[-1])
        # Find the last checkpoint
        checkpoints = next(os.walk(dir_name))[2]
        checkpoints = filter(lambda f: f.startswith("mask_rcnn"), checkpoints)
        checkpoints = sorted(checkpoints)
        if not checkpoints:
            return dir_name, None
        checkpoint = os.path.join(dir_name, checkpoints[-1])
return dir_name, checkpoint
awaiting response feature

Most helpful comment

@GustavZ I believe restarting the train.py job with the same command line arguments should pick up the last saved checkpoint in the checkpoint directory. This is a feature built into Supervisor, which the TF Object Detection API uses.

Have you noticed a situation where killing a training job doesn't load the last checkpoint?

All 13 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

CC @derekjchow for your thoughts on this feature request.

As workaround i wrote a shell script that automatically updates the config with the last saved checkpoint of the adressed directory and restarts training if it breaks due to any error. Idk if this is of interest for you...

@GustavZ,hi,many training works break up by "OOM",can your script release memory of GPU first when it restart training automaticallly?

@liangxiao05 yes it does, as it restarts all python processes which allocate the gpu mem

that's cool , and I think you don't need to wirte the checkpoint in the config file when training breaks,just restart the 'python object_detection/train.py '. I support you to open this PR,it will be useful ,thanks!

i dont think just restarting train.py is not enough as it always from the provided checkpoint in the config and if it does not get updated it always restarts from the same point. So basically thats the whole point of the small script i wrote, reading the most recent checkpoint number and updating the config with it.

@GustavZ I believe restarting the train.py job with the same command line arguments should pick up the last saved checkpoint in the checkpoint directory. This is a feature built into Supervisor, which the TF Object Detection API uses.

Have you noticed a situation where killing a training job doesn't load the last checkpoint?

Training resumes from the latest checkpoint it has saved if 'from_detection_checkpoint' is set to True in config file. You can see this being used when creating a model in line 250-256 in trainer.py

I had the same issue. You can just set NUM_TRAIN_STEPS to None and also point fine_tune directory to the same directory you wan to load. Then it should work.

@liangxiao05 yes it does, as it restarts all python processes which allocate the gpu mem

@liangxiao05 yes it does, as it restarts all python processes which allocate the gpu mem

@GustavZ Can you help me in re-starting training from the last check point??

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nmfisher picture nmfisher  路  3Comments

dsindex picture dsindex  路  3Comments

kamal4493 picture kamal4493  路  3Comments

Mostafaghelich picture Mostafaghelich  路  3Comments

jacknlliu picture jacknlliu  路  3Comments