Deepspeech: Restoring from checkpoint failed.

Created on 6 Dec 2019 · 28Comments · Source: mozilla/DeepSpeech

virtualenv --python=python3.6 env

source env/bin/activate

git clone https://github.com/mozilla/DeepSpeech
git checkout v0.6.0

downloaded v0.6.0 pretrained checkpoint
https://github.com/mozilla/DeepSpeech/releases/download/v0.6.0/deepspeech-0.6.0-checkpoint.tar.gz

cd DeepSpeech
pip install -r requirements.txt

pip install tensorflow-gpu == 1.14.0

pip3 install $(python3 util/taskcluster.py --decoder)

Continuing training from a release model:
mkdir fine_tuning_checkpoints
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

Instructions for updating:
Use standard file APIs to check for files with this prefix.
W1206 06:45:41.998423 140389067556672 deprecation.py:323] From /media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
I1206 06:45:42.020016 140389067556672 saver.py:1280] Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1296, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1614, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 554, in train
    loaded = try_loading(session, best_dev_saver, 'best_dev_checkpoint', 'best validation')
  File "DeepSpeech.py", line 403, in try_loading
    saver.restore(session, checkpoint_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow version: 1.14.0
Python version: 3.6.5
CUDA/cuDNN version: 10.0
GPU model and memory: 24 GB x 4 GPUs

How to resolve this issue? i was followed right instructions but why it is happened?

Source

MuruganR96

Most helpful comment

You have to specify --use_cudnn_rnn, it's not enabled by default.

reuben on 6 Dec 2019

❤2

All 28 comments

@MuruganR96 Can you share pip list output ?

lissyx on 6 Dec 2019

👍1

Package              Version     
-------------------- ------------
absl-py              0.8.1       
astor                0.8.0       
attrdict             2.0.1       
audioread            2.1.8       
bcrypt               3.1.7       
beautifulsoup4       4.8.1       
bs4                  0.0.1       
certifi              2019.11.28  
cffi                 1.13.2      
chardet              3.0.4       
cryptography         2.8         
cycler               0.10.0      
decorator            4.4.1       
ds-ctcdecoder        0.6.0       
gast                 0.3.2       
google-pasta         0.1.8       
grpcio               1.25.0      
h5py                 2.10.0      
idna                 2.8         
joblib               0.14.0      
Keras-Applications   1.0.8       
Keras-Preprocessing  1.1.0       
kiwisolver           1.1.0       
librosa              0.7.1       
llvmlite             0.30.0      
Markdown             3.1.1       
matplotlib           3.1.2       
numba                0.46.0      
numpy                1.15.4      
pandas               0.25.3      
paramiko             2.7.0       
pip                  19.3.1      
pkg-resources        0.0.0       
progressbar2         3.47.0      
protobuf             3.11.1      
pycparser            2.19        
PyNaCl               1.3.0       
pyparsing            2.4.5       
python-dateutil      2.8.1       
python-utils         2.3.0       
pytz                 2019.3      
pyxdg                0.26        
requests             2.22.0      
resampy              0.2.2       
scikit-learn         0.22        
scipy                1.3.3       
setuptools           42.0.2      
six                  1.13.0      
SoundFile            0.10.3.post1
soupsieve            1.9.5       
sox                  1.3.7       
tensorboard          1.14.0      
tensorflow-estimator 1.14.0      
tensorflow-gpu       1.14.0      
termcolor            1.1.0       
urllib3              1.25.7      
webrtcvad            2.0.10      
Werkzeug             0.16.0      
wheel                0.33.6      
wrapt                1.11.2

MuruganR96 on 6 Dec 2019

Weird. I remember this error when loading a cudnn checkpoint on a non cudnn setup, can you check that? I' the release notes we also document the flag to use in that case, can you test with it?

lissyx on 6 Dec 2019

👍1

@lissyx sir, i think, you are mentioned this flag.

--cudnn_checkpoint: path to a checkpoint created using --use_cudnn_rnn.
    Specifying this flag allows one to convert a CuDNN RNN checkpoint to a
    checkpoint capable of running on a CPU graph.
    (default: '')

command:
CUDA_VISIBLE_DEVICES=2 python3 DeepSpeech.py --n_hidden 2048 --cudnn_checkpoint ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

I Converting CuDNN RNN checkpoint from ./deepspeech-0.6.0-checkpoint
Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 525, in train
    ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 65, in load_checkpoint
    "given directory %s" % ckpt_dir_or_file)
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1

NVIDIA TITAN RTX

@lissyx sir. what is problem for v0.6.0 pretrained checkpoint files?
in my side i made any mistakes?

MuruganR96 on 6 Dec 2019

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

Are you sure about the path ? How about --checkpoint_dir as well ?

lissyx on 6 Dec 2019

@lissyx sir, i added both --checkpoint_dir and --cudnn_checkpoint both.

CUDA_VISIBLE_DEVICES=2 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --cudnn_checkpoint ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

same error :)

I Converting CuDNN RNN checkpoint from ./deepspeech-0.6.0-checkpoint
Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 525, in train
    ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 65, in load_checkpoint
    "given directory %s" % ckpt_dir_or_file)
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

MuruganR96 on 6 Dec 2019

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

Well, have you checked what the error states ?

lissyx on 6 Dec 2019

yes. everything is fine. _./deepspeech-0.6.0-checkpoint_ in this folder pretrained checkpoints already present.

but it is showing

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ./deepspeech-0.6.0-checkpoint

MuruganR96 on 6 Dec 2019

yes. everything is fine. _./deepspeech-0.6.0-checkpoint_ in this folder pretrained checkpoints already present.

You don't clearly answer. What is the content of ./deepspeech-0.6.0-checkpoint/ ?

lissyx on 6 Dec 2019

You don't clearly answer. What is the content of ./deepspeech-0.6.0-checkpoint/ ?

@lissyx sir,

ls deepspeech-0.6.0-checkpoint
best_dev-233784.data-00000-of-00001  best_dev-233784.index  best_dev-233784.meta  best_dev_checkpoint  flags.txt

@lissyx . I found the problem is,

ckpt = tfv1.train.load_checkpoint(FLAGS.cudnn_checkpoint)

it was not picking checkpoint from the directory and not loading. i tested like this,

ckpt = tfv1.train.load_checkpoint("/media/user1/storage-1/Murugan/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784")

I Initializing missing Adam moment tensors.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:17 | Steps: 202 | Loss: 16.245042                                                                                                                                                                  Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[{{node tower_0/CTCLoss}}]]
     [[Mean_8/_91]]
  (1) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[{{node tower_0/CTCLoss}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 971, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 944, in main
    train()
  File "DeepSpeech.py", line 637, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 605, in run_set
    feed_dict=feed_dict)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[node tower_0/CTCLoss (defined at DeepSpeech.py:231) ]]
     [[Mean_8/_91]]
  (1) Invalid argument: Not enough time for target transition sequence (required: 89, available: 53)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
     [[node tower_0/CTCLoss (defined at DeepSpeech.py:231) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node tower_0/CTCLoss:
 tower_0/raw_logits (defined at DeepSpeech.py:196)  
 tower_0/DeserializeSparse (defined at DeepSpeech.py:220)

Input Source operations connected to node tower_0/CTCLoss:
 tower_0/raw_logits (defined at DeepSpeech.py:196)  
 tower_0/DeserializeSparse (defined at DeepSpeech.py:220)

Original stack trace for 'tower_0/CTCLoss':
  File "DeepSpeech.py", line 971, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 944, in main
    train()
  File "DeepSpeech.py", line 474, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 301, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 231, in calculate_mean_edit_distance_and_loss
    total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/ctc_ops.py", line 176, in ctc_loss
    ignore_longer_outputs_than_inputs=ignore_longer_outputs_than_inputs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 335, in ctc_loss
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

MuruganR96 on 6 Dec 2019

it was not picking checkpoint from the directory and not loading. i tested like this,

ckpt = tfv1.train.load_checkpoint("/media/user1/storage-1/Murugan/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784")

That's not the appropriate way. Please use --cudnn_checkpoint.

ls deepspeech-0.6.0-checkpoint
best_dev-233784.data-00000-of-00001 best_dev-233784.index best_dev-233784.meta best_dev_checkpoint flags.txt

Try adding a checkpoint symlink that links to best_dev_checkpoint.

lissyx on 6 Dec 2019

@lissyx sir. i have a doubt.

if we trying to use --cudnn_checkpoint means the --cudnn_checkpoint flag is only needed when converting a CuDNN RNN checkpoint to a CPU-capable graph.
If your system is capable of using CuDNN RNN, you can just specify the CuDNN RNN checkpoint normally with --checkpoint_dir.

here i am case 2.
my system is capable of using CuDNN RNN. then normally with --checkpoint_dir is enough for me.

but why i need -cudnn_checkpoint?

@lissyx i am bit confusing. clarify once :)

MuruganR96 on 6 Dec 2019

here i am case 2.
my system is capable of using CuDNN RNN. then normally with --checkpoint_dir is enough for me.

but why i need -cudnn_checkpoint?

This is what I asked you in the beginning, if your setup was properly done for CuDNN. The error obviously suggests it's not the case.

lissyx on 6 Dec 2019

👍1

@lissyx sir. my CuDNN setup might be wrong?

@lissyx sir. how to resolve this issue? what is the problem here i did? :)

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
     [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

MuruganR96 on 6 Dec 2019

@lissyx sir. my CuDNN setup might be wrong?

@lissyx sir. how to resolve this issue? what is the problem here i did? :)

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
   [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Ok, I can't keep repeating over and over the same things. I told you: the error is because it cannot resume using CuDNN. Check your setup if it is supposed to work.

lissyx on 6 Dec 2019

👍1

You have to specify --use_cudnn_rnn, it's not enabled by default.

reuben on 6 Dec 2019

❤2

@reuben sir. i tried --use_cudnn_rnn true.

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001 --use_cudnn_rnn true

not yet resolved. :(

tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[{{node save_1/CudnnRNNCanonicalToParams}}]]

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[{{node save_1/CudnnRNNCanonicalToParams}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 972, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 945, in main
    train()
  File "DeepSpeech.py", line 561, in train
    loaded = try_loading(session, best_dev_saver, 'best_dev_checkpoint', 'best validation')
  File "DeepSpeech.py", line 403, in try_loading
    saver.restore(session, checkpoint_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[node save_1/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/CudnnRNNCanonicalToParams':
  File "DeepSpeech.py", line 972, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 945, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 350, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 744, in restore
    restored_tensors)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 221, in tf_canonical_to_opaque
    opaque_params = self._cu_canonical_to_opaque(cu_weights, cu_biases)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 271, in _cu_canonical_to_opaque
    direction=self._direction)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 917, in cudnn_rnn_canonical_to_params
    seed=seed, seed2=seed2, name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

MuruganR96 on 7 Dec 2019

resolved. i used this flag with their suggestion link.

--use_allow_growth true

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

Thanks @lissyx @reuben

MuruganR96 on 7 Dec 2019

resolved. i used this flag with their suggestion link.

--use_allow_growth true

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

Thanks @lissyx @reuben

@MuruganR96 I'm having the same issue that you faced, as i following your solution. This is the error that i am getting. @lissyx @reuben sir your intention is also required. Thanks!

Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 I1218 12:42:33.675762 139888463619840 saver.py:1280] Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: E E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [seed=4568, dropout=0, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240] E Registered devices: [CPU, XLA_CPU] E Registered kernels: E <no registered kernels> E E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]] E The checkpoint in deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of deepspeech-0.6.0-checkpoint/best_dev-233784.

l192423 on 18 Dec 2019

resolved. i used this flag with their suggestion link.
--use_allow_growth true
CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.0-checkpoint/ --epochs 3 --train_files data/train_18-11-2019.csv --dev_files data/dev_18-11-2019.csv --test_files data/test_18-11-2019.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true
Thanks @lissyx @reuben

@MuruganR96 I'm having the same issue that you faced, as i following your solution. This is the error that i am getting. @lissyx @reuben sir your intention is also required. Thanks!

Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 I1218 12:42:33.675762 139888463619840 saver.py:1280] Restoring parameters from deepspeech-0.6.0-checkpoint/best_dev-233784 E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: E E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [seed=4568, dropout=0, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240] E Registered devices: [CPU, XLA_CPU] E Registered kernels: E <no registered kernels> E E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]] E The checkpoint in deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of deepspeech-0.6.0-checkpoint/best_dev-233784.

Have you passed the cudnn flags?

lissyx on 18 Dec 2019

Your error seems to suggest your cudnn installation is wrong.

lissyx on 18 Dec 2019

Have you passed the cudnn flags?

Yes sir i have tried --checkpoint_dir and --cudnn_checkpoint both method.

l192423 on 18 Dec 2019

@l192423 Please check your setup then, you lack the CUDNN kernels. Either you CUDA installation is broken / incomplete, or your TensorFlow is, or both.

lissyx on 18 Dec 2019

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.

Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

l192423 on 18 Dec 2019

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.

Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

lissyx on 18 Dec 2019

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.
Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

In this case i am just using the --checkpoint_dir flag and start training without using any checkpoint.
But when i am using this command with checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/ --export_tflite --export_dir /home/neeha/Tayyab/export --epochs 1 --train_files /home/neeha/Tayyab/CV/clips/train.csv --dev_files /home/neeha/Tayyab/CV/clips/dev.csv --test_files /home/neeha/Tayyab/CV/clips/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

i receive these error

E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E
E
E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E The checkpoint in /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784.

l192423 on 26 Dec 2019

OK i will check my setup. Another thing is when i start training without checkpoint than everything would be fine and training continue without any issue. If my tensorflow or CUDNN installation is broken than why it is working when i am not using checkpoint.
This is my current training status without including checkpoint.
Epoch 0 | Training | Elapsed Time: 1 day, 21:40:02 | Steps: 3786 | Loss: 88.452532

Are you enabling cudnn in this case?

In this case i am just using the --checkpoint_dir flag and start training without using any checkpoint.
But when i am using this command with checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/ --export_tflite --export_dir /home/neeha/Tayyab/export --epochs 1 --train_files /home/neeha/Tayyab/CV/clips/train.csv --dev_files /home/neeha/Tayyab/CV/clips/dev.csv --test_files /home/neeha/Tayyab/CV/clips/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

i receive these error

E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:118) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E
E
E [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E The checkpoint in /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /home/neeha/Tayyab/DeepSpeech/deepspeech-0.6.0-checkpoint/best_dev-233784.