Models: BERT training on TPU with keras_compile_fit results in error when starting validation step

Created on 17 Oct 2019  路  6Comments  路  Source: tensorflow/models

I'm using BERT model in colab with TPU runtime selected. I installed the last nightly version of tensorflow 2.0 as the current stable one had some functionalities missing. The checkpoint model and the training and validation datasets are stored in a google storage bucket as recommended. The ouput directory is as well a folder on gs.
The training for the first epoch works well but it doesn't seem to get past this step. I tried both training methods implemented keras or that custom tensorflow training (_model_training_utils.run_customized_training_loop_). The following output is displayed on run:

I1017 07:57:10.538644 140106235082624 remote.py:156] Entering into master device scope: /job:worker/replica:0/task:0
2019-10-17 07:57:10.542454: E tensorflow/stream_executor/cuda/cuda_driver.cc:334] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
INFO:tensorflow:Initializing the TPU system: 10.112.92.218:8470
I1017 07:57:10.572217 140106235082624 tpu_strategy_util.py:71] Initializing the TPU system: 10.112.92.218:8470
INFO:tensorflow:Clearing out eager caches
I1017 07:57:10.663481 140106235082624 tpu_strategy_util.py:95] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I1017 07:57:18.129874 140106235082624 tpu_strategy_util.py:115] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I1017 07:57:18.135898 140106235082624 tpu_system_metadata.py:148] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I1017 07:57:18.136226 140106235082624 tpu_system_metadata.py:149] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I1017 07:57:18.136900 140106235082624 tpu_system_metadata.py:150] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I1017 07:57:18.136965 140106235082624 tpu_system_metadata.py:152] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1017 07:57:18.137022 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1017 07:57:18.137387 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1017 07:57:18.137451 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I1017 07:57:18.137512 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I1017 07:57:18.137567 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I1017 07:57:18.137620 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I1017 07:57:18.137671 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I1017 07:57:18.137726 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I1017 07:57:18.137780 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I1017 07:57:18.137832 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I1017 07:57:18.137883 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I1017 07:57:18.137935 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1017 07:57:18.137987 140106235082624 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
strategy
I1017 07:57:18.306019 140106235082624 train.py:211] Training using TF 2.0 Keras compile/fit API with distrubuted strategy.
Train for 258 steps, validate for 65 steps
Epoch 1/3
  1/258 [..............................] - ETA: 6:56:39 - loss: 0.7556 - test_accuracy: 0.4688WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.674855). Check your callbacks.
W1017 08:00:08.274050 140106235082624 callbacks.py:246] Method (on_train_batch_end) is slow compared to the batch update (1.674855). Check your callbacks.
257/258 [============================>.] - ETA: 0s - loss: 7.2636 - test_accuracy: 0.42132019-10-17 08:01:11.056463: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
2019-10-17 08:01:11.056718: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.056881: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.057070: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.057218: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.057321: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.057427: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2019-10-17 08:01:11.057527: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:71] Unable to destroy remote tensor handles: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.

The error is truncated as the last error message is repeated.

I would very much appreciate some help regarding why this might happen. For now I don't know if I'm doing something wrong or it's a bug and I shouldn't continue with debugging.

Most helpful comment

@georgealexandruvlad
Colab team is a bit slow to update TF2 TPU backend now. I would recommend to use Cloud TPU with tf-nightly using ctpu with "nightly-2.x".

All 6 comments

After several tens of identical error messages I found another message which proves that problem resides in evaluation step:

Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 329, in main
    run_bert(strategy, input_meta_data)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 286, in run_bert
    use_keras_compile_fit=FLAGS.use_keras_compile_fit)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 225, in run_bert_classifier
    custom_callbacks=None)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 140, in run_keras_compile_fit
    callbacks=custom_callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 766, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 391, in fit
    validation_steps = eval_context.progbar.progbar.target
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 738, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 300, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 974, in on_epoch_end
    self._save_model(epoch=epoch, logs=logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 1019, in _save_model
    self.model.save_weights(filepath, overwrite=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 1101, in save_weights
    self._trackable_saver.save(filepath, session=session)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/util.py", line 1155, in save
    file_prefix=file_prefix_tensor, object_graph_tensor=object_graph_tensor)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/util.py", line 1103, in _save_cached_when_graph_building
    save_op = saver.save(file_prefix)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saving/functional_saver.py", line 230, in save
    sharded_saves.append(saver.save(shard_prefix))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saving/functional_saver.py", line 69, in save
    tensors.append(spec.tensor)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saving/saveable_object.py", line 52, in tensor
    return self._tensor() if callable(self._tensor) else self._tensor
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saving/saveable_object_util.py", line 94, in f
    return array_ops.identity(x)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py", line 267, in identity
    ret = gen_array_ops.identity(input, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py", line 3787, in identity
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6567, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:worker/replica:0/task:0/device:TPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run Identity: Driver not open.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors. [Op:Identity]
2019-10-17 08:01:14.986031: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:48] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1571299274.984309880","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2019-10-17 08:01:14.986171: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:48] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1571299274.984353623","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2019-10-17 08:01:14.986207: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:48] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1571299274.984364549","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2019-10-17 08:01:14.986234: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:48] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1571299274.984376841","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2019-10-17 08:01:14.986498: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:48] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1571299274.984601967","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-4795759030291117013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}

And the last message is repeated.

@hongjunChoi Please take a look.
Sorry about that.

Hello @georgealexandruvlad,

This is likely happening because the binary version of the colab is not up-to-date with the most recent TF version. I'll reach out to the colab team and see if colab with most up-to-date binary is accessible.

@hongjunChoi thanks, I am looking forward to your reply! :)

@georgealexandruvlad
Colab team is a bit slow to update TF2 TPU backend now. I would recommend to use Cloud TPU with tf-nightly using ctpu with "nightly-2.x".

@saberkun I removed the decorators added for last nightly tensorflow 2 (@tf.keras.utils.register_keras_serializable(package='Text')) from gelu and swish functions and it worked with the tf2.0 in colab. Thanks!

Was this page helpful?
0 / 5 - 0 ratings