This framework restores parameters every checkpoint and takes a long time. But the parameters don't change during this period, so there must be some solution to prevent it.
Restoring parameters is the whole point of using checkpoints. It is not clear what your issue is about.
Maybe you are not satisfied with the continuous_train_and_eval schedule (see related issue #556). If you don't need/want the internal evaluation, you can use --schedule=train. You can also use --schedule=train_and_evaluate, though it still needs the checkpoints (but just one per each evaluation) and it takes more memory than continuous_train_and_eval (and train).
It is a pity tf.contrib.learn.Experiment cannot do the most simple schedule: train for x steps or y minutes, evaluate the current model on dev set without storing nor retrieving any checkpoint, continue training for x steps or y minutes, exactly from the position where we ended last time, and so on. Storing checkpoints could be set completely independently on the internal evaluation period (or could be turned off, storing only the final checkpoint). Of course, this schedule is not suitable for distributed training (but it is suitable for single-machine multi-GPU training), but the same holds for continuous_train_and_eval.
Thank you for instructing me about tf.contrib.learn.Experiment. The most confusing thing to me is the prompt "Restoring parameters from xxxxx" after evaluation period. The parameters in the memory keep unchanged during the evaluation period, why do we restore them from the disk? Restoring is unnecessary because the parameters in the memory and in the disk are the same.
The same problem, restoring from params would cause OOM.
@littleDing: to keep issues focused: This issue is not about OOM, it is just about the inability to turn off checkpoint writing+reading (or rather to set its frequency independently on the internal evaluation frequency). There are several other OOM-related issues in T2T, e.g. #581.