Tensor2tensor: Is it possible to prevent restoring parameters from checkpoint?

Created on 5 Mar 2018 · 4Comments · Source: tensorflow/tensor2tensor

This framework restores parameters every checkpoint and takes a long time. But the parameters don't change during this period, so there must be some solution to prevent it.

question

Source

draplater

All 4 comments

Restoring parameters is the whole point of using checkpoints. It is not clear what your issue is about.

Maybe you are not satisfied with the continuous_train_and_eval schedule (see related issue #556). If you don't need/want the internal evaluation, you can use --schedule=train. You can also use --schedule=train_and_evaluate, though it still needs the checkpoints (but just one per each evaluation) and it takes more memory than continuous_train_and_eval (and train).

It is a pity tf.contrib.learn.Experiment cannot do the most simple schedule: train for x steps or y minutes, evaluate the current model on dev set without storing nor retrieving any checkpoint, continue training for x steps or y minutes, exactly from the position where we ended last time, and so on. Storing checkpoints could be set completely independently on the internal evaluation period (or could be turned off, storing only the final checkpoint). Of course, this schedule is not suitable for distributed training (but it is suitable for single-machine multi-GPU training), but the same holds for continuous_train_and_eval.

martinpopel on 5 Mar 2018

👍1

Thank you for instructing me about tf.contrib.learn.Experiment. The most confusing thing to me is the prompt "Restoring parameters from xxxxx" after evaluation period. The parameters in the memory keep unchanged during the evaluation period, why do we restore them from the disk? Restoring is unnecessary because the parameters in the memory and in the disk are the same.

draplater on 5 Mar 2018

👍1

The same problem, restoring from params would cause OOM.

littleDing on 30 May 2018

@littleDing: to keep issues focused: This issue is not about OOM, it is just about the inability to turn off checkpoint writing+reading (or rather to set its frequency independently on the internal evaluation frequency). There are several other OOM-related issues in T2T, e.g. #581.

martinpopel on 30 May 2018

Was this page helpful?

0 / 5 - 0 ratings