Bert: how to see loss per steps or epoch during train?

Created on 7 Nov 2018 · 33Comments · Source: google-research/bert

I successfully ran run_squad.py, but from tf.logging, I only see below, while I want to see loss is going down. how to make logs for loss or acc from training?

INFO:tensorflow:global_step/sec: 1.73561
INFO:tensorflow:examples/sec: 20.8273
INFO:tensorflow:global_step/sec: 1.73487
INFO:tensorflow:examples/sec: 20.8184
INFO:tensorflow:global_step/sec: 1.73578
INFO:tensorflow:examples/sec: 20.8294
INFO:tensorflow:global_step/sec: 1.73657
INFO:tensorflow:examples/sec: 20.8389
INFO:tensorflow:global_step/sec: 1.73621
INFO:tensorflow:examples/sec: 20.8345
INFO:tensorflow:global_step/sec: 1.73602
INFO:tensorflow:examples/sec: 20.8322
INFO:tensorflow:global_step/sec: 1.73591
INFO:tensorflow:examples/sec: 20.831
INFO:tensorflow:global_step/sec: 1.7353
INFO:tensorflow:examples/sec: 20.8236
INFO:tensorflow:global_step/sec: 1.73526
INFO:tensorflow:examples/sec: 20.8231

Source

minsuk-heo

Most helpful comment

@minsuk-heo I can add training hook in TPUEstimatorSpec:

logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
training_hooks=[logging_hook],
scaffold_fn=scaffold_fn)

avostryakov on 7 Nov 2018

👍58

All 33 comments

Unfortunately this is a bug TPUEstimator where it doesn't print the loss when training on CPU/GPU. However what you can do is is to run eval mode in a separate terminal on the training directory. Basically it's the same command but --do_train=False and --do_predict=True. Although that will give you predictions rather than the loss. If you only have one GPU you can run the eval on the CPU (it will just be slow).

The hackiest way to print the loss is just to use loss = tf.Print(loss, [loss])) before returning the loss.

jacobdevlin-google on 7 Nov 2018

👍5

@minsuk-heo I can add training hook in TPUEstimatorSpec:

avostryakov on 7 Nov 2018

👍58

thanks @avostryakov,
your solution works perfectly!

minsuk-heo on 7 Nov 2018

thanks @avostryakov,
your solution works perfectly!

Did you solve this problem? I can not see the logs for loss by add training hook

FuYanzhe2 on 13 Dec 2018

@FuYanzhe2 me too.

1000sprites on 19 Dec 2018

@avostryakov I try it, but when it run, it occured :
TypeError: __new__() got an unexpected keyword argument 'training_hooks'

Please help me, my tensorflow version is 1.10

xiongma on 28 Dec 2018

@avostryakov I try it, but when it run, it occured :
TypeError: new() got an unexpected keyword argument 'training_hooks'

Please help me, my tensorflow version is 1.10

use tf 1.11

zheolong on 30 Jan 2019

👍1

INFO:tensorflow:loss = 232.23836
just loss,no step!

hantaozi on 11 Mar 2019

👀5

@FuYanzhe2 me too.

You should add tf.logging.set_verbosity(tf.logging.INFO) first

Dikea on 3 Jun 2019

@avostryakov Did you succeed with LoggingTensorHook? I tried but got an error saying "loss/Mean" marked as unfetchable. It seems like a TPU issue.

RuibinMa on 22 Jun 2019

👍7

Is there any way to add a early stopping hook? How do I pass estimator to the TPUEstimatorSpec?

abhi060698 on 23 Jul 2019

👍1

@abhi060698 If you want to use a early_stopping_hook, please use estimator.train_and_evaluate api.

train_hooks_list = [early_stopping_hook()] # implemented by yourself
train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list)
eval_spec =  # implemented by yourself
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

stevewyl on 30 Jul 2019

@abhi060698 If you want to use a early_stopping_hook, please use estimator.train_and_evaluate api.

train_hooks_list = [early_stopping_hook()] # implemented by yourself
train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list)
eval_spec =  # implemented by yourself
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

这个不是tpu的吧 @stevewyl

dh12306 on 23 Aug 2019

Mark

guotong1988 on 5 Sep 2019

@dh12306 It's just for original Estimator API. Maybe TPU doesn't work.

stevewyl on 6 Sep 2019

I've been having all sorts of issues trying to get TPUEstimator to play nice with summaries and logging, so it would be really nice to get just this working
Running @avostryakov 's code in TF 1.14 on a TPUv3 throws the following error:

tensorflow[24580] ERROR Error recorded from outfeed: Attempted to use a closed Session.
tensorflow[24580] ERROR Error recorded from infeed: Attempted to use a closed Session.
tensorflow[24580] ERROR Error recorded from training_loop: Operation 'add_5' has been marked as not fetchable.

Allong12 on 5 Oct 2019

👍1

can anybody tell me what does global_step/sec and examples/sec means? what's difference between them? and how they are helpful?

also what can I do if I dont want to see them during training

Ashbajawed on 15 Oct 2019

what about if i want to get the training time for each epoch?

elvinjgalarza on 10 Nov 2019

train_hooks_list = [early_stopping_hook()] # implemented by yourself train_spec = tf.estimator.TrainSpec(train_input_fn, hooks=train_hooks_list) eval_spec = # implemented by yourself tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

does not work on TPU

check https://github.com/tensorflow/tpu/issues/606

prokok on 12 Dec 2019

@Allong12 did u solve this problem, I trained my model on TPUV3-8, it appeared same problem.

xiongma on 23 Dec 2019

I tried the approach, adding a logging_hook, but I run into this error:

tensorflow.python.framework.errors_impl.InaccessibleTensorError: Operation u'sparse_softmax_cross_entropy_loss/value' has been marked as not fetchable. Typically this happens when it is defined in another function or code block. Use re
turn values,explicit Python locals or TensorFlow collections to access it.

wmmxk on 25 Dec 2019

@wmmxk I have the same error as you when I try adding logging_hook to print training accuracy during training

tensorflow.python.framework.errors_impl.InaccessibleTensorError: Operation u'my_acc' has been marked as not fetchable. Typically this happens when it is defined in another function or code block . Use return values,explicit Python locals or TensorFlow collections to access it.

Have you found the solution ?

Yasmine1991 on 27 Dec 2019

Finally, I couldn't use logging hook and I chose to print accuracy via tensorboard

Yasmine1991 on 9 Jan 2020

Experiencing the same problem running on TPU. Works fine on CPU for testing purposes.

steindor on 22 Feb 2020

Hi @Yasmine1991 ,
can you show me how to add that into tensorboard? what if I want to add let's say precision and recall how can I do that? Thanks

ronykalfarisi on 23 Mar 2020

Hi @ronykalfarisi

Well, you have to first launch cloudshell in a second clousdshell session
ctpu up --name=your tpu name --zone=your tpu zone
Then, in this second cloud shell, you create an environment variable for your bucket cloud storage and for your model repertory
export STORAGE_BUCKET=gs://bucket_name
export MODEL_DIR=${STORAGE_BUCKET}/output
Thereafter, you load tensorboard with this line
tensorboard --logdir=${MODEL_DIR} &

Meanwhile in your code, within the "tf.estimator.ModeKeys.TRAIN" mode, you create a host call function that you will call in the return of this mode by doing
return tf.compat.v1.estimator.tpu.TPUEstimatorSpec(mode, loss=loss, train_op=train_op, host_call=host_call)

The host call function -->

def host_call_fn(gs, acc):
gs = gs[0] # gs is a Tensor with shape [batch] for the global_step
with tf2.summary.create_file_writer(FLAGS.model_dir).as_default():
with tf2.summary.record_if(True):
tf2.summary.scalar('Training accuracy', acc[0], step=gs)
return tf.compat.v1.summary.all_v2_summary_ops()

   ` gs_t = tf.reshape(tf.compat.v1.train.get_global_step() 
    acc_t = tf.reshape(accuracy["accuracy"][1]*100, [1])   
     host_call = (host_call_fn, [gs_t,  acc_t])`

Yasmine1991 on 25 May 2020

@Yasmine1991 this is for tensorflow 2.x right? Is it possible to solve this for tensorflow 1.x?

Azrael1 on 6 Jun 2020

@Azrael1 yes it is for tensorflow 2.x. For older version, I don't know. I suppose, there will be probably some modifications

Yasmine1991 on 8 Jun 2020

@hantaozi
Modify it optimization.create_optimizer value of optimizer, take new_ global_ step can print steps in hook

  train_op, new_global_step = optimization.create_optimizer(
      total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  tensors_to_log = {'train loss': total_loss, 'global_step': new_global_step}
  logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=1)
  output_spec = tf.contrib.tpu.TPUEstimatorSpec(
      mode=mode,
      loss=total_loss,
      train_op=train_op,
      training_hooks=[logging_hook],
      scaffold_fn=scaffold_fn)

NicolasCookie on 28 Jun 2020

@avostryakov I try it, but when it run, it occured :
TypeError: new() got an unexpected keyword argument 'training_hooks'
Please help me, my tensorflow version is 1.10

use tf 1.11

Is must use tf 1.11？