Deepspeech: "UnknownError: Failed to get convolution algorithm" from ./bin/run-ldc93s1.sh

Created on 25 Jun 2019 · 40Comments · Source: mozilla/DeepSpeech

Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No custom code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): PopOS (Ubuntu derivative) 18.10
TensorFlow installed from (our builds, or upstream TensorFlow): pip commands from DeepSpeech Readme: pip3 uninstall tensorflow && pip3 install 'tensorflow-gpu==1.13.1'
TensorFlow version (use command below): tensorflow-gpu 1.13.1
Python version: 3.7.1
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: cuda 10.0 / cudnn 7.5 (as specified in DeepSpeech readme)
GPU model and memory: GeForce RTX 2060, 5904MiB
Exact command to reproduce: ./bin/run-ldc93s1.sh (from DeepSpeech readme)

First off thanks to everyone working on DeepSpeech for a really awesome open source package.

I have the same issue as described here: https://github.com/mozilla/DeepSpeech/issues/2119 . That issue was closed with the instruction "please stick to Tensorflow recommended versions," although the user specified they used Tensorflow 1.13.1 which is (currently at least) the TF version specified in the DeepSpeech Readme. This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.

I have followed the DeepSpeech readme installation instructions carefully and have installed all requirements, including correct versions of cuda/cudnn. I can run DeepSpeech command to do voice-to-text inference successfully using the downloaded pretrained model, but when retraining that model (using DeepSpeech Readme's "Training a Model" script: ./bin/run-ldc93s1.sh ), I get the following error:

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. 
This is probably because cuDNN failed to initialize, so try looking to see if a warning log 
message was printed above.
     [[{{node tower_0/conv1d/Conv2D}}]]

Full log / stacktrace:

(dsenv) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d  ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/home/mepstein/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/mepstein/.local/share/deepspeech/ldc93s1
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.

WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
I Training epoch 0...
Traceback (most recent call last):
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node tower_0/conv1d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 510, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 483, in run_set
    feed_dict=feed_dict)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Caused by op 'tower_0/conv1d/Conv2D', defined at:
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 400, in train
    gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 253, in get_tower_results
    avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
  File "DeepSpeech.py", line 119, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "DeepSpeech.py", line 56, in create_overlapping_windows
    batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
    data_format=data_format)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Source

MaxPowerWasTaken

Most helpful comment

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

...and now ./bin/run-ldc93s1.sh trains without errors.

MaxPowerWasTaken on 25 Jun 2019

👍3 ❤1

All 40 comments

This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.

I could argue that it's working fine for a lot of other people ...

export CUDA_VISIBLE_DEVICES=0

This hides CUDA devices, have you tried without that ?

lissyx on 25 Jun 2019

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

This is really not a DeepSpeech level error.

lissyx on 25 Jun 2019

Hi lissyx, thanks for the quick reply! I did not mean to sound argumentative, I only included "this appears to be a bug to me" because the prompt I was shown for filling out an issue included "Please describe the problem clearly. Be sure to convey here why it's a bug or a feature request."

I will try to override that export CUDA_VISIBLE_DEVICES=0 now, thanks for the suggestion.

MaxPowerWasTaken on 25 Jun 2019

Hi lissyx, thanks for the quick reply! I did not mean to sound argumentative, I only included "this appears to be a bug to me" because the prompt I was shown for filling out an issue included "Please describe the problem clearly. Be sure to convey here why it's a bug or a feature request."

Yeah, but at some point, we can't reproduce the error, and it's tensorflow-level error, so I'm sorry but when people start mixing unsupported CuDNN version in the mix, it's really not easy to help ...

lissyx on 25 Jun 2019

There's also an upstream issue: https://github.com/tensorflow/tensorflow/issues/24828

lissyx on 25 Jun 2019

There's also an upstream issue: tensorflow/tensorflow#24828

Two seconds of googling would unveil that and you could try some of the fix some people are reporting to work.

lissyx on 25 Jun 2019

@MaxPowerWasTaken You can notice that the proposed documented fixes are posterior to the original issue you refer to, and that it seems to be related to a GPU memory issue when some other processes are using it.

lissyx on 25 Jun 2019

Hi lissyx, I assure you I've spent more than two days trying to get past this error before filing this issue and have read many many threads on github, discourse, and nvidia devtalk. The Tensorflow issue you pointed to is full of complaints about cuda 9.0 with tensorflow 1.12 and multiple suggestions to downgrade to tf 1.8. Since (per DeepSpeech) I am running cuda 10.0 and tf 1.13, that doesn't seem to be a promising avenue.

It's hard to imagine my current issue is a GPU memory issue since my GPU memory has 573MiB / 5904MiB free and the DeepSpeech training script failed on the first epoch training on a single audio file.

I will keep looking into this and appreciate the suggestions but the presumption that I've not tried googling this issue or immediately jumped to posting an issue is just not accurate.

MaxPowerWasTaken on 25 Jun 2019

Are you using conda in the same system? It has a tendency to break everything that doesn't use the conda installed packages/tools. Check that you're not loading CUDA/cuDNN/NCCL from Conda instead of the upstream installs.

reuben on 25 Jun 2019

I will keep looking into this and appreciate the suggestions but the presumption that I've not tried googling this issue or immediately jumped to posting an issue is just not accurate.

And the presumption that we closed without caring is also inaccurate. People report issue that we don't reproduce, with unsupported CuDNN version, half of the required information are not filled, how can you expect to be able to help then ?

Hi lissyx, I assure you I've spent more than two days trying to get past this error before filing this issue and have read many many threads on github, discourse, and nvidia devtalk. The Tensorflow issue you pointed to is full of complaints about cuda 9.0 with tensorflow 1.12 and multiple suggestions to downgrade to tf 1.8. Since (per DeepSpeech) I am running cuda 10.0 and tf 1.13, that doesn't seem to be a promising avenue.

Some suggestions seems to apply to multiple versions. Have you tried it?

It's hard to imagine my current issue is a GPU memory issue since my GPU memory has 573MiB / 5904MiB free and the DeepSpeech training script failed on the first epoch training on a single audio file.

Well, I'm just inferring from the above TensorFlow issue and the behavior of TensorFlow to try to allocate all memory for itself.

lissyx on 25 Jun 2019

Ok thanks for the links and suggestions lissyx and reuben. Based on your suggestions and links, I will try the following things and then report back:

change export CUDA_VISIBLE_DEVICES=0 to export CUDA_VISIBLE_DEVICES=-1 (though I'm not sure where yet, but should be able to figure out quickly)
creating a new venv with tf-gpu downgraded to 1.8
"allow gpu memory growth" with...

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

removing miniconda from my system (I don't have full anaconda and used venv not conda for virtual env on my DeepSpeech env, but I do have miniconda python distribution) and then recreating venv and trying to retrain model again.

After trying all of these I will report back with what worked (hopefully one of these will), or if the problem persists.

MaxPowerWasTaken on 25 Jun 2019

creating a new venv with tf-gpu downgraded to 1.8

Don't, it's not going to work

"allow gpu memory growth" with...

this code might need to be adjusted, it seems to be for TensorFlow v2

lissyx on 25 Jun 2019

removing miniconda from my system (I don't have full anaconda and used venv not conda for virtual env on my DeepSpeech env, but I do have miniconda python distribution) and then recreating venv and trying to retrain model again.

Idon't know what miniconda might do, but it's indeed recurrent we got people with issues using this.

lissyx on 25 Jun 2019

So update (good news, and a question):

changing export CUDA_VISIBLE_DEVICES=0 to export CUDA_VISIBLE_DEVICES=-1 worked!

But I made that change within DeepSpeech's script ./bin/run-ldc93s1.sh, and the comment above suggests that's because training that dataset (a single audio clip) wouldn't work on GPUs:

# Force only one visible device because we have a single-sample dataset
# and when trying to run on multiple devices (like GPUs), this will break
export CUDA_VISIBLE_DEVICES=0

But the DeepSpeech readme suggests using that script to train on GPUs. Should the DeepSpeech Readme surrounding "Training a Model..../bin/run-ldc93s1.sh..." be updated in some manner (e.g. that if running on gpu, that one line of run-ldc93s1.sh needs to be edited first)? If so, I'd be happy to take out a PR to try and make that clarification in the readme, if that would be helpful.

Thanks again to both of you for the help!

MaxPowerWasTaken on 25 Jun 2019

That line does not need to be edited first, we run it without changes on systems with and without GPUs with no problems.

reuben on 25 Jun 2019

Interesting, I wonder why in my case it's necessary. I just confirmed that with original /bin/run-ldc93s1.sh I get the original error I posted, and that once again changing that single line of code and rerunning /bin/run-ldc93s1.sh trains successfully.

Just out of curiosity, do you run/test deepspeech on any RTX-series GPUs? I read somewhere today that the RTX 2060/2070 seem to be implicated frequently in these mystery cuda/cudnn errors, and I do have an RTX 2060...

In any case, my issue is solved so I will close this. Thanks again.

MaxPowerWasTaken on 25 Jun 2019

Yes, I'm frequently training on RTX2080TI. have you tried the growth option?

lissyx on 25 Jun 2019

Interesting. I have not tried the growth option yet after the updating CUDA_VISIBLE_DEVICES solved my issue, but I will try it soon (with CUDA_VISIBLE_DEVICES=0) and report back here.

MaxPowerWasTaken on 25 Jun 2019

I don't think that change is solving your issue, it's just hiding all GPUs and thus disabling use of CUDA entirely.

reuben on 25 Jun 2019

Let's wait for this, because I don't buy this solution. As reuben said, we do train successfully this script...

lissyx on 25 Jun 2019

Ok I tried to set allow_growth to True, but I'm still getting the same error with CUDA_VISIBLE_DEVICES=0. It's possible I am setting allow_growth wrong, but I don't think I am?

Since ./bin/run-ldc93s1.sh calls python -u DeepSpeech.py and DeepSpeech.py imports the session config from utils/config.py, I tried the following:

add c.session_config.gpu_options.allow_growth = True directly under the following line in utils/config.py:
- c.session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=FLAGS.log_placement,
  inter_op_parallelism_threads=FLAGS.inter_op_parallelism_threads,
  intra_op_parallelism_threads=FLAGS.intra_op_parallelism_threads)
reran ./bin/run-ldc93s1.sh from shell.

MaxPowerWasTaken on 25 Jun 2019

* add `c.session_config.gpu_options.allow_growth = True` directly under the following line in `utils/config.py`:

At least that looks like it is consistent with r1.13 codebase

@MaxPowerWasTaken maybe try do set LD_DEBUG=all in your env when running, to catch where CuDNN comes from. Maybe there's some mixup somewhere, and it is not the one you want that loads.

lissyx on 25 Jun 2019

...and now ./bin/run-ldc93s1.sh trains without errors.

MaxPowerWasTaken on 25 Jun 2019

👍3 ❤1

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

...and now ./bin/run-ldc93s1.sh trains without errors.

So it would confirm it's this allow_growth and just that your way of setting it was wrong. i'd like to understand better what that option does, and if we should use it.

lissyx on 25 Jun 2019

👍1

Just out of curiosity, could you share the output of nvidia-smi before and during the bin/run-ldc93s1.sh execution?

reuben on 25 Jun 2019

(base) mepstein@pop-os:~$ nvidia-smi
Tue Jun 25 21:16:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    On   | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8     3W /  N/A |    324MiB /  5904MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1552      G   /usr/lib/xorg/Xorg                            27MiB |
|    0      1719      G   /usr/bin/gnome-shell                          55MiB |
|    0      2795      G   /usr/lib/xorg/Xorg                           128MiB |
|    0      3083      G   cinnamon                                      60MiB |
|    0      3370      G   ...quest-channel-token=6465864532133321617    50MiB |
+-----------------------------------------------------------------------------+

...

(base) mepstein@pop-os:~$ cd DeepSpeech/
(base) mepstein@pop-os:~/DeepSpeech$ source dsenv/bin/activate
(dsenv) (base) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh

...100 epochs in, while still training, in another shell tab:

(dsenv) (base) mepstein@pop-os:~/DeepSpeech$ nvidia-smi
Tue Jun 25 21:21:29 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P5    12W /  N/A |    875MiB /  5904MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1552      G   /usr/lib/xorg/Xorg                            27MiB |
|    0      1719      G   /usr/bin/gnome-shell                          55MiB |
|    0      2795      G   /usr/lib/xorg/Xorg                           128MiB |
|    0      3083      G   cinnamon                                      59MiB |
|    0      3370      G   ...quest-channel-token=6465864532133321617    49MiB |
|    0     19352      C   python                                       541MiB |
+-----------------------------------------------------------------------------+

MaxPowerWasTaken on 26 Jun 2019

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

...and now ./bin/run-ldc93s1.sh trains without errors.

Thanks !

BenXQ on 26 Jun 2019

👍1

So from looking into tensorflow source code: Grow GPU memory as needed at the cost of fragmentation. I'm not a big fan of forcing that by default. And given the nvidia-smi outputs, I'm unsure why it failed in the first place.

@MaxPowerWasTaken Could you please do the LD_DEBUG=all thing, please? We still don't know if there's something else touching cudnn.

lissyx on 26 Jun 2019

Hi lissyx,

Here's a link to my full log from rerunning ./bin/run-ldc93s1.sh after replacing os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' with os.environ['LD_DEBUG'] = 'all' in my DeepSpeech.py file:

https://gist.github.com/MaxPowerWasTaken/27a578bd7592077c9658af5981aab996

MaxPowerWasTaken on 26 Jun 2019

Hi lissyx,

Here's a link to my full log from rerunning ./bin/run-ldc93s1.sh after replacing os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' with os.environ['LD_DEBUG'] = 'all' in my DeepSpeech.py file:

https://gist.github.com/MaxPowerWasTaken/27a578bd7592077c9658af5981aab996

Weird, there is no mention at all of libcuda or libcudnn being loaded. But close to the error, we can see some miniconda3 libs being loaded and used:

     30172: binding file /home/mepstein/miniconda3/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so [0] to /home/mepstein/miniconda3/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so [0]: normal symbol `PyInit__posixsubprocess'

@MaxPowerWasTaken Could you try to make a conda-free virtual environment ?

lissyx on 26 Jun 2019

Hi lissyx,

I completely removed miniconda from my machine, then reran. I seem to be getting a very similar log, though confirmed no references to conda anymore.

https://gist.github.com/MaxPowerWasTaken/71366e8db01e2d07883548b4844a7700

MaxPowerWasTaken on 26 Jun 2019

Hi lissyx,

I completely removed miniconda from my machine, then reran. I seem to be getting a very similar log, though confirmed no references to conda anymore.

https://gist.github.com/MaxPowerWasTaken/71366e8db01e2d07883548b4844a7700

Can you try a different python version ? But at some point, I would really advise sharing that in an upstream issue.

lissyx on 26 Jun 2019

Weird, there's no mention of libcudnn or any CUDA libraries at all in that log, like it's not even trying to load them.

reuben on 26 Jun 2019

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

...and now ./bin/run-ldc93s1.sh trains without errors.

Thank you! I've been dealing with this problem for a LONG time and this finally solved it.

Dl2oWn on 26 Jun 2019

👍1

@MaxPowerWasTaken I guess a note in README would be a good PR at least, we've got already two people chiming in on your solution ...

lissyx on 26 Jun 2019

Hi lissyx,

I ran with Python3.7. Happy to update my pipfile to another version if you have one in mind but I'm not optimistic it'll fix it?

I may chime in on one of the upstream tensorflow issues, but the issues I've found (one or two you linked to here) are closed and all proposed solutions seem to be either:

setting memory_growth to true, or
changing cuda or cudnn or tf version to something outside of the DeepSpeech required versions.

Hopefully eventually I'll come to enough of an understanding of how cuda/cudnn/tf work together to have more insight into this sort of thing. But at the moment it all seems very mysterious to me.

If you'd like me to submit a PR with a note in the readme with references to those upstream tf issues and the allow_growth "solution" that worked for me, let me know and I'll submit one.

MaxPowerWasTaken on 26 Jun 2019

changing cuda or cudnn or tf version to something outside of the DeepSpeech required versions.

There is no such thing, we just point to TensorFlow's versions.

If you'd like me to submit a PR with a note in the readme with references to those upstream tf issues and the allow_growth "solution" that worked for me, let me know and I'll submit one.

That could be a nice middle ground in the mean time, yes.

I may chime in on one of the upstream tensorflow issues, but the issues I've found (one or two you linked to here) are closed and all proposed solutions seem to be either:

it seems pretty obvious to me the error message is misleading and there is another root in your case.

lissyx on 26 Jun 2019

re:

changing cuda or cudnn or tf version to something outside of the DeepSpeech required versions.

^ There is no such thing, we just point to TensorFlow's versions.

...this might be a key? So the DeepSpeech readme says "The GPU capable builds (Python, NodeJS, C++, etc) depend on the same CUDA runtime as upstream TensorFlow. Currently with TensorFlow 1.13 it depends on CUDA 10.0 and CuDNN v7.5."

But this TF page (https://www.tensorflow.org/install/gpu - see "install cuda with apt" / "Ubuntu 18.04 (CUDA 10)" section) suggests installing cudnn7.6 (not 7.5) for tensorflow on gpu. Although that presumably refers to the tensorflow-gpu version on pypi, which is 1.14 not 1.13.

Would it be worthwhile for me to try installing tensorflow-gpu 1.14 and libcudnn7.6 and then rerunning DeepSpeech? Or would DeepSpeech not be expected to work on tensorflow-gpu 1.14?

MaxPowerWasTaken on 26 Jun 2019

But this TF page (https://www.tensorflow.org/install/gpu - see "install cuda with apt" / "Ubuntu 18.04 (CUDA 10)" section) suggests installing cudnn7.6 (not 7.5) for tensorflow on gpu. Although that presumably refers to the tensorflow-gpu version on pypi, which is 1.14 not 1.13.

We did copy from the TensorFlow pages at the time of r1.13, after having numeral issues opened by people not using the proper version and not following our link to their docs. So except if we made a mistake or if they did, yes, it's much more likely that r1.14 is the reason for v7.6.

Would it be worthwhile for me to try installing tensorflow-gpu 1.14 and libcudnn7.6 and then rerunning DeepSpeech? Or would DeepSpeech not be expected to work on tensorflow-gpu 1.14?

Well you may have troubles with ds_ctcdecoder package, since it's built against r1.13, but that's the only thing that should really block you, if it does.

lissyx on 26 Jun 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.