Deepspeech: Restoring from checkpoint failed.

Created on 6 Dec 2019  路  28Comments  路  Source: mozilla/DeepSpeech

virtualenv --python=python3.6 env

source env/bin/activate

git clone https://github.com/mozilla/DeepSpeech
git checkout v0.6.0

downloaded v0.6.0 pretrained checkpoint
https://github.com/mozilla/DeepSpeech/releases/download/v0.6.0/deepspeech-0.6.0-checkpoint.tar.gz

cd DeepSpeech
pip install -r requirements.txt

pip install tensorflow-gpu == 1.14.0

pip3 install $(python3 util/taskcluster.py --decoder)

Continuing training from a release model:
mkdir fine_tuning_checkpoints
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

Instructions for updating:
Use standard file APIs to check for files with this prefix.
W1206 06:45:41.998423 140389067556672 deprecation.py:323] From /media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
I1206 06:45:42.020016 140389067556672 saver.py:1280] Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1296, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1614, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 554, in train
    loaded = try_loading(session, best_dev_saver, 'best_dev_checkpoint', 'best validation')
  File "DeepSpeech.py", line 403, in try_loading
    saver.restore(session, checkpoint_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow version: 1.14.0
  • Python version: 3.6.5
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: 24 GB x 4 GPUs

How to resolve this issue? i was followed right instructions but why it is happened?

Most helpful comment

You have to specify --use_cudnn_rnn, it's not enabled by default.

All 28 comments

@MuruganR96 Can you share pip list output ?

Package              Version     
-------------------- ------------
absl-py              0.8.1       
astor                0.8.0       
attrdict             2.0.1       
audioread            2.1.8       
bcrypt               3.1.7       
beautifulsoup4       4.8.1       
bs4                  0.0.1       
certifi              2019.11.28  
cffi                 1.13.2      
chardet              3.0.4       
cryptography         2.8         
cycler               0.10.0      
decorator            4.4.1       
ds-ctcdecoder        0.6.0       
gast                 0.3.2       
google-pasta         0.1.8       
grpcio               1.25.0      
h5py                 2.10.0      
idna                 2.8         
joblib               0.14.0      
Keras-Applications   1.0.8       
Keras-Preprocessing  1.1.0       
kiwisolver           1.1.0       
librosa              0.7.1       
llvmlite             0.30.0      
Markdown             3.1.1       
matplotlib           3.1.2       
numba                0.46.0      
numpy                1.15.4      
pandas               0.25.3      
paramiko             2.7.0       
pip                  19.3.1      
pkg-resources        0.0.0       
progressbar2         3.47.0      
protobuf             3.11.1      
pycparser            2.19        
PyNaCl               1.3.0       
pyparsing            2.4.5       
python-dateutil      2.8.1       
python-utils         2.3.0       
pytz                 2019.3      
pyxdg                0.26        
requests             2.22.0      
resampy              0.2.2       
scikit-learn         0.22        
scipy                1.3.3       
setuptools           42.0.2      
six                  1.13.0      
SoundFile            0.10.3.post1
soupsieve            1.9.5       
sox                  1.3.7       
tensorboard          1.14.0      
tensorflow-estimator 1.14.0      
tensorflow-gpu       1.14.0      
termcolor            1.1.0       
urllib3              1.25.7      
webrtcvad            2.0.10      
Werkzeug             0.16.0      
wheel                0.33.6      
wrapt                1.11.2      

Weird. I remember this error when loading a cudnn checkpoint on a non cudnn setup, can you check that? I' the release notes we also document the flag to use in that case, can you test with it?

@lissyx sir, i think, you are mentioned this flag.

--cudnn_checkpoint: path to a checkpoint created using --use_cudnn_rnn.
    Specifying this flag allows one to convert a CuDNN RNN checkpoint to a
    checkpoint capable of running on a CPU graph.
    (default: '')

command:
CUDA_VISIBLE_DEVICES=2 python3 DeepSpeech.py --n_hidden 2048 --cudnn_checkpoint ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

I Converting CuDNN RNN checkpoint from ./deepspeech-0.6.0-checkpoint
Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 525, in train
    ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 65, in load_checkpoint
    "given directory %s" % ckpt_dir_or_file)
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1

NVIDIA TITAN RTX

@lissyx sir. what is problem for v0.6.0 pretrained checkpoint files?
in my side i made any mistakes?

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

Are you sure about the path ? How about --checkpoint_dir as well ?

@lissyx sir, i added both --checkpoint_dir and --cudnn_checkpoint both.

CUDA_VISIBLE_DEVICES=2 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --cudnn_checkpoint ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

same error :)

I Converting CuDNN RNN checkpoint from ./deepspeech-0.6.0-checkpoint
Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 525, in train
    ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 65, in load_checkpoint
    "given directory %s" % ckpt_dir_or_file)
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint


ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

Well, have you checked what the error states ?

yes. everything is fine. _./deepspeech-0.6.0-checkpoint_ in this folder pretrained checkpoints already present.

but it is showing

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

yes. everything is fine. _./deepspeech-0.6.0-checkpoint_ in this folder pretrained checkpoints already present.

You don't clearly answer. What is the content of ./deepspeech-0.6.0-checkpoint/ ?

You don't clearly answer. What is the content of ./deepspeech-0.6.0-checkpoint/ ?

@lissyx sir,

ls deepspeech-0.6.0-checkpoint
best_dev-233784.data-00000-of-00001  best_dev-233784.index  best_dev-233784.meta  best_dev_checkpoint  flags.txt

@lissyx . I found the problem is,

ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)

it was not picking checkpoint from the directory and not loading. i tested like this,

ckpt = tfv1.train.load_checkpoint("/media/user1/storage-1/Murugan/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784")

I Initializing missing Adam moment tensors.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:17 | Steps: 202 | Loss: 16.245042                                                                                                                                                                  Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[{{node tower_0/CTCLoss}}]]
     [[Mean_8/_91]]
  (1) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[{{node tower_0/CTCLoss}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 971, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 944, in main
    train()
  File "DeepSpeech.py", line 637, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 605, in run_set
    feed_dict=feed_dict)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[node tower_0/CTCLoss (defined at DeepSpeech.py:231) ]]
     [[Mean_8/_91]]
  (1) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[node tower_0/CTCLoss (defined at DeepSpeech.py:231) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node tower_0/CTCLoss:
 tower_0/raw_logits (defined at DeepSpeech.py:196)  
 tower_0/DeserializeSparse (defined at DeepSpeech.py:220)

Input Source operations connected to node tower_0/CTCLoss:
 tower_0/raw_logits (defined at DeepSpeech.py:196)  
 tower_0/DeserializeSparse (defined at DeepSpeech.py:220)

Original stack trace for 'tower_0/CTCLoss':
  File "DeepSpeech.py", line 971, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 944, in main
    train()
  File "DeepSpeech.py", line 474, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 301, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 231, in calculate_mean_edit_distance_and_loss
    total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/ctc_ops.py", line 176, in ctc_loss
    ignore_longer_outputs_than_inputs=ignore_longer_outputs_than_inputs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 335, in ctc_loss
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

it was not picking checkpoint from the directory and not loading. i tested like this,

ckpt = tfv1.train.load_checkpoint("/media/user1/storage-1/Murugan/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784")

That's not the appropriate way. Please use --cudnn_checkpoint.

ls deepspeech-0.6.0-checkpoint
best_dev-233784.data-00000-of-00001 best_dev-233784.index best_dev-233784.meta best_dev_checkpoint flags.txt

Try adding a checkpoint symlink that links to best_dev_checkpoint.

@lissyx sir. i have a doubt.

  1. if we trying to use --cudnn_checkpoint means the --cudnn_checkpoint flag is only needed when converting a CuDNN RNN checkpoint to a CPU-capable graph.

  2. If your system is capable of using CuDNN RNN, you can just specify the CuDNN RNN checkpoint normally with --checkpoint_dir.

here i am case 2.
my system is capable of using CuDNN RNN. then normally with --checkpoint_dir is enough for me.

but why i need -cudnn_checkpoint?

@lissyx i am bit confusing. clarify once :)

here i am case 2.
my system is capable of using CuDNN RNN. then normally with --checkpoint_dir is enough for me.

but why i need -cudnn_checkpoint?

This is what I asked you in the beginning, if your setup was properly done for CuDNN. The error obviously suggests it's not the case.

@lissyx sir. my CuDNN setup might be wrong?

@lissyx sir. how to resolve this issue? what is the problem here i did? :)

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

@lissyx sir. my CuDNN setup might be wrong?

@lissyx sir. how to resolve this issue? what is the problem here i did? :)

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
   [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Ok, I can't keep repeating over and over the same things. I told you: the error is because it cannot resume using CuDNN. Check your setup if it is supposed to work.

You have to specify --use_cudnn_rnn, it's not enabled by default.

@reuben sir. i tried --use_cudnn_rnn true.

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001 --use_cudnn_rnn true

not yet resolved. :(

tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[{{node save_1/CudnnRNNCanonicalToParams}}]]
Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[{{node save_1/CudnnRNNCanonicalToParams}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 972, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 945, in main
    train()
  File "DeepSpeech.py", line 561, in train
    loaded = try_loading(session, best_dev_saver, 'best_dev_checkpoint', 'best validation')
  File "DeepSpeech.py", line 403, in try_loading
    saver.restore(session, checkpoint_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[node save_1/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/CudnnRNNCanonicalToParams':
  File "DeepSpeech.py", line 972, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 945, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 350, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 744, in restore
    restored_tensors)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 221, in tf_canonical_to_opaque
    opaque_params = self._cu_canonical_to_opaque(cu_weights, cu_biases)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 271, in _cu_canonical_to_opaque
    direction=self._direction)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 917, in cudnn_rnn_canonical_to_params
    seed=seed, seed2=seed2, name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

resolved. i used this flag with their suggestion link.

--use_allow_growth true

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

Thanks @lissyx @reuben

resolved. i used this flag with their suggestion link.

--use_allow_growth true

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

Thanks @lissyx @reuben

@MuruganR96 I'm having the same issue that you faced, as i following your solution. This is the error that i am getting. @lissyx @reuben sir your intention is also required. Thanks!

Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 I1218 12:42:33.675762 139888463619840 saver.py:1280] Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: E E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [seed=4568, dropout=0, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240] E Registered devices: [CPU, XLA_CPU] E Registered kernels: E <no registered kernels> E E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]] E The checkpoint in deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of deepspeech-0.6.0-checkpoint/best_dev-233784.

resolved. i used this flag with their suggestion link.
--use_allow_growth true
CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true
Thanks @lissyx @reuben

@MuruganR96 I'm having the same issue that you faced, as i following your solution. This is the error that i am getting. @lissyx @reuben sir your intention is also required. Thanks!

Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 I1218 12:42:33.675762 139888463619840 saver.py:1280] Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: E E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [seed=4568, dropout=0, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240] E Registered devices: [CPU, XLA_CPU] E Registered kernels: E <no registered kernels> E E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]] E The checkpoint in deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of deepspeech-0.6.0-checkpoint/best_dev-233784.

Have you passed the cudnn flags?

Your error seems to suggest your cudnn installation is wrong.

Have you passed the cudnn flags?

Yes sir i have tried --checkpoint_dir and --cudnn_checkpoint both method.

@l192423 Please check your setup then, you lack the CUDNN kernels. Either you CUDA installation is broken / incomplete, or your TensorFlow is, or both.

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.

Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.

Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.
Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

In this case i am just using the --checkpoint_dir flag and start training without using any checkpoint.
But when i am using this command with checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/ --export_tflite --export_dir /home/neeha/Tayyab/export --epochs 1 --train_files /home/neeha/Tayyab/CV/clips/train.csv --dev_files /home/neeha/Tayyab/CV/clips/dev.csv --test_files /home/neeha/Tayyab/CV/clips/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

i receive these error

E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E
E
E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E The checkpoint in /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784.

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.
Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

In this case i am just using the --checkpoint_dir flag and start training without using any checkpoint.
But when i am using this command with checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/ --export_tflite --export_dir /home/neeha/Tayyab/export --epochs 1 --train_files /home/neeha/Tayyab/CV/clips/train.csv --dev_files /home/neeha/Tayyab/CV/clips/dev.csv --test_files /home/neeha/Tayyab/CV/clips/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

i receive these error

E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E
E
E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E The checkpoint in /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784.

So we are back to square one: your TensorFlow / CUDNN setup is broken.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings