Bert: how to see loss per steps or epoch during train?

Created on 7 Nov 2018  ·  33Comments  ·  Source: google-research/bert

I successfully ran run_squad.py, but from tf.logging, I only see below, while I want to see loss is going down. how to make logs for loss or acc from training?

INFO:tensorflow:global_step/sec: 1.73561
INFO:tensorflow:examples/sec: 20.8273
INFO:tensorflow:global_step/sec: 1.73487
INFO:tensorflow:examples/sec: 20.8184
INFO:tensorflow:global_step/sec: 1.73578
INFO:tensorflow:examples/sec: 20.8294
INFO:tensorflow:global_step/sec: 1.73657
INFO:tensorflow:examples/sec: 20.8389
INFO:tensorflow:global_step/sec: 1.73621
INFO:tensorflow:examples/sec: 20.8345
INFO:tensorflow:global_step/sec: 1.73602
INFO:tensorflow:examples/sec: 20.8322
INFO:tensorflow:global_step/sec: 1.73591
INFO:tensorflow:examples/sec: 20.831
INFO:tensorflow:global_step/sec: 1.7353
INFO:tensorflow:examples/sec: 20.8236
INFO:tensorflow:global_step/sec: 1.73526
INFO:tensorflow:examples/sec: 20.8231

Most helpful comment

@minsuk-heo I can add training hook in TPUEstimatorSpec:

logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
training_hooks=[logging_hook],
scaffold_fn=scaffold_fn)

All 33 comments

Unfortunately this is a bug TPUEstimator where it doesn't print the loss when training on CPU/GPU. However what you can do is is to run eval mode in a separate terminal on the training directory. Basically it's the same command but --do_train=False and --do_predict=True. Although that will give you predictions rather than the loss. If you only have one GPU you can run the eval on the CPU (it will just be slow).

The hackiest way to print the loss is just to use loss = tf.Print(loss, [loss])) before returning the loss.

@minsuk-heo I can add training hook in TPUEstimatorSpec:

logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
training_hooks=[logging_hook],
scaffold_fn=scaffold_fn)

thanks @avostryakov,
your solution works perfectly!

thanks @avostryakov,
your solution works perfectly!

Did you solve this problem? I can not see the logs for loss by add training hook

@FuYanzhe2 me too.

@avostryakov I try it, but when it run, it occured :
TypeError: __new__() got an unexpected keyword argument 'training_hooks'

Please help me, my tensorflow version is 1.10

@avostryakov I try it, but when it run, it occured :
TypeError: new() got an unexpected keyword argument 'training_hooks'

Please help me, my tensorflow version is 1.10

use tf 1.11

INFO:tensorflow:loss = 232.23836
just loss,no step!

@FuYanzhe2 me too.

You should add tf.logging.set_verbosity(tf.logging.INFO) first

@avostryakov Did you succeed with LoggingTensorHook? I tried but got an error saying "loss/Mean" marked as unfetchable. It seems like a TPU issue.

Is there any way to add a early stopping hook? How do I pass estimator to the TPUEstimatorSpec?

@abhi060698 If you want to use a early_stopping_hook, please use estimator.train_and_evaluate api.

train_hooks_list = [early_stopping_hook()] # implemented by yourself
train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list)
eval_spec =  # implemented by yourself
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

@abhi060698 If you want to use a early_stopping_hook, please use estimator.train_and_evaluate api.

train_hooks_list = [early_stopping_hook()] # implemented by yourself
train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list)
eval_spec =  # implemented by yourself
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

这个不是tpu的吧 @stevewyl

Mark

@dh12306 It's just for original Estimator API. Maybe TPU doesn't work.

I've been having all sorts of issues trying to get TPUEstimator to play nice with summaries and logging, so it would be really nice to get just this working
Running @avostryakov 's code in TF 1.14 on a TPUv3 throws the following error:

tensorflow[24580] ERROR Error recorded from outfeed: Attempted to use a closed Session.
tensorflow[24580] ERROR Error recorded from infeed: Attempted to use a closed Session.
tensorflow[24580] ERROR Error recorded from training_loop: Operation 'add_5' has been marked as not fetchable.

can anybody tell me what does global_step/sec and examples/sec means? what's difference between them? and how they are helpful?

also what can I do if I dont want to see them during training

what about if i want to get the training time for each epoch?

train_hooks_list = [early_stopping_hook()] # implemented by yourself train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list) eval_spec = # implemented by yourself tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

does not work on TPU

check https://github.com/tensorflow/tpu/issues/606

@Allong12 did u solve this problem, I trained my model on TPUV3-8, it appeared same problem.

I tried the approach, adding a logging_hook, but I run into this error:

tensorflow.python.framework.errors_impl.InaccessibleTensorError: Operation u'sparse_softmax_cross_entropy_loss/value' has been marked as not fetchable. Typically this happens when it is defined in another function or code block. Use re
turn values,explicit Python locals or TensorFlow collections to access it.

@wmmxk I have the same error as you when I try adding logging_hook to print training accuracy during training

tensorflow.python.framework.errors_impl.InaccessibleTensorError: Operation u'my_acc' has been marked as not fetchable. Typically this happens when it is defined in another function or code block . Use return values,explicit Python locals or TensorFlow collections to access it.

Have you found the solution ?

Finally, I couldn't use logging hook and I chose to print accuracy via tensorboard

Experiencing the same problem running on TPU. Works fine on CPU for testing purposes.

Hi @Yasmine1991 ,
can you show me how to add that into tensorboard? what if I want to add let's say precision and recall how can I do that? Thanks

Hi @ronykalfarisi

  • Well, you have to first launch cloudshell in a second clousdshell session
    ctpu up --name=your tpu name --zone=your tpu zone
  • Then, in this second cloud shell, you create an environment variable for your bucket cloud storage and for your model repertory
    export STORAGE_BUCKET=gs://bucket_name
    export MODEL_DIR=${STORAGE_BUCKET}/output
  • Thereafter, you load tensorboard with this line
    tensorboard --logdir=${MODEL_DIR} &

Meanwhile in your code, within the "tf.estimator.ModeKeys.TRAIN" mode, you create a host call function that you will call in the return of this mode by doing
return tf.compat.v1.estimator.tpu.TPUEstimatorSpec(mode, loss=loss, train_op=train_op, host_call=host_call)

The host call function -->

def host_call_fn(gs, acc):
gs = gs[0] # gs is a Tensor with shape [batch] for the global_step
with tf2.summary.create_file_writer(FLAGS.model_dir).as_default():
with tf2.summary.record_if(True):
tf2.summary.scalar('Training accuracy', acc[0], step=gs)
return tf.compat.v1.summary.all_v2_summary_ops()

   ` gs_t = tf.reshape(tf.compat.v1.train.get_global_step() 
    acc_t = tf.reshape(accuracy["accuracy"][1]*100, [1])   
     host_call = (host_call_fn, [gs_t,  acc_t])`

@Yasmine1991 this is for tensorflow 2.x right? Is it possible to solve this for tensorflow 1.x?

@Azrael1 yes it is for tensorflow 2.x. For older version, I don't know. I suppose, there will be probably some modifications

@hantaozi
Modify it optimization.create_optimizer value of optimizer, take new_ global_ step can print steps in hook

  train_op, new_global_step = optimization.create_optimizer(
      total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  tensors_to_log = {'train loss': total_loss, 'global_step': new_global_step}
  logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=1)
  output_spec = tf.contrib.tpu.TPUEstimatorSpec(
      mode=mode,
      loss=total_loss,
      train_op=train_op,
      training_hooks=[logging_hook],
      scaffold_fn=scaffold_fn)

@avostryakov I try it, but when it run, it occured :
TypeError: new() got an unexpected keyword argument 'training_hooks'
Please help me, my tensorflow version is 1.10

use tf 1.11

Is must use tf 1.11?

@lbda1 I use tensorflow 1.15.2, which is quite stable

training_hooks=[logging_hook],

this works! thank you

how can I avoid to print log such as "global_step/sec" and "examples/sec", because there are too many

Was this page helpful?
0 / 5 - 0 ratings