pip3 uninstall tensorflow && pip3 install 'tensorflow-gpu==1.13.1'./bin/run-ldc93s1.sh (from DeepSpeech readme)First off thanks to everyone working on DeepSpeech for a really awesome open source package.
I have the same issue as described here: https://github.com/mozilla/DeepSpeech/issues/2119 . That issue was closed with the instruction "please stick to Tensorflow recommended versions," although the user specified they used Tensorflow 1.13.1 which is (currently at least) the TF version specified in the DeepSpeech Readme. This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.
I have followed the DeepSpeech readme installation instructions carefully and have installed all requirements, including correct versions of cuda/cudnn. I can run DeepSpeech command to do voice-to-text inference successfully using the downloaded pretrained model, but when retraining that model (using DeepSpeech Readme's "Training a Model" script: ./bin/run-ldc93s1.sh ), I get the following error:
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm.
This is probably because cuDNN failed to initialize, so try looking to see if a warning log
message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
Full log / stacktrace:
(dsenv) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/home/mepstein/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/mepstein/.local/share/deepspeech/ldc93s1
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
I Training epoch 0...
Traceback (most recent call last):
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 510, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "DeepSpeech.py", line 483, in run_set
feed_dict=feed_dict)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
Caused by op 'tower_0/conv1d/Conv2D', defined at:
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 400, in train
gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
File "DeepSpeech.py", line 253, in get_tower_results
avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
File "DeepSpeech.py", line 119, in create_model
batch_x = create_overlapping_windows(batch_x)
File "DeepSpeech.py", line 56, in create_overlapping_windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
data_format=data_format)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.
I could argue that it's working fine for a lot of other people ...
- export CUDA_VISIBLE_DEVICES=0
This hides CUDA devices, have you tried without that ?
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
This is really not a DeepSpeech level error.
Hi lissyx, thanks for the quick reply! I did not mean to sound argumentative, I only included "this appears to be a bug to me" because the prompt I was shown for filling out an issue included "Please describe the problem clearly. Be sure to convey here why it's a bug or a feature request."
I will try to override that export CUDA_VISIBLE_DEVICES=0 now, thanks for the suggestion.
Hi lissyx, thanks for the quick reply! I did not mean to sound argumentative, I only included "this appears to be a bug to me" because the prompt I was shown for filling out an issue included "Please describe the problem clearly. Be sure to convey here why it's a bug or a feature request."
Yeah, but at some point, we can't reproduce the error, and it's tensorflow-level error, so I'm sorry but when people start mixing unsupported CuDNN version in the mix, it's really not easy to help ...
There's also an upstream issue: https://github.com/tensorflow/tensorflow/issues/24828
There's also an upstream issue: tensorflow/tensorflow#24828
Two seconds of googling would unveil that and you could try some of the fix some people are reporting to work.
@MaxPowerWasTaken You can notice that the proposed documented fixes are posterior to the original issue you refer to, and that it seems to be related to a GPU memory issue when some other processes are using it.
Hi lissyx, I assure you I've spent more than two days trying to get past this error before filing this issue and have read many many threads on github, discourse, and nvidia devtalk. The Tensorflow issue you pointed to is full of complaints about cuda 9.0 with tensorflow 1.12 and multiple suggestions to downgrade to tf 1.8. Since (per DeepSpeech) I am running cuda 10.0 and tf 1.13, that doesn't seem to be a promising avenue.
It's hard to imagine my current issue is a GPU memory issue since my GPU memory has 573MiB / 5904MiB free and the DeepSpeech training script failed on the first epoch training on a single audio file.
I will keep looking into this and appreciate the suggestions but the presumption that I've not tried googling this issue or immediately jumped to posting an issue is just not accurate.
Are you using conda in the same system? It has a tendency to break everything that doesn't use the conda installed packages/tools. Check that you're not loading CUDA/cuDNN/NCCL from Conda instead of the upstream installs.
I will keep looking into this and appreciate the suggestions but the presumption that I've not tried googling this issue or immediately jumped to posting an issue is just not accurate.
And the presumption that we closed without caring is also inaccurate. People report issue that we don't reproduce, with unsupported CuDNN version, half of the required information are not filled, how can you expect to be able to help then ?
Hi lissyx, I assure you I've spent more than two days trying to get past this error before filing this issue and have read many many threads on github, discourse, and nvidia devtalk. The Tensorflow issue you pointed to is full of complaints about cuda 9.0 with tensorflow 1.12 and multiple suggestions to downgrade to tf 1.8. Since (per DeepSpeech) I am running cuda 10.0 and tf 1.13, that doesn't seem to be a promising avenue.
Some suggestions seems to apply to multiple versions. Have you tried it?
It's hard to imagine my current issue is a GPU memory issue since my GPU memory has 573MiB / 5904MiB free and the DeepSpeech training script failed on the first epoch training on a single audio file.
Well, I'm just inferring from the above TensorFlow issue and the behavior of TensorFlow to try to allocate all memory for itself.
Ok thanks for the links and suggestions lissyx and reuben. Based on your suggestions and links, I will try the following things and then report back:
export CUDA_VISIBLE_DEVICES=0 to export CUDA_VISIBLE_DEVICES=-1 (though I'm not sure where yet, but should be able to figure out quickly)from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
After trying all of these I will report back with what worked (hopefully one of these will), or if the problem persists.
creating a new venv with tf-gpu downgraded to 1.8
Don't, it's not going to work
"allow gpu memory growth" with...
this code might need to be adjusted, it seems to be for TensorFlow v2
removing miniconda from my system (I don't have full anaconda and used venv not conda for virtual env on my DeepSpeech env, but I do have miniconda python distribution) and then recreating venv and trying to retrain model again.
Idon't know what miniconda might do, but it's indeed recurrent we got people with issues using this.
So update (good news, and a question):
export CUDA_VISIBLE_DEVICES=0 to export CUDA_VISIBLE_DEVICES=-1 worked!But I made that change within DeepSpeech's script ./bin/run-ldc93s1.sh, and the comment above suggests that's because training that dataset (a single audio clip) wouldn't work on GPUs:
# Force only one visible device because we have a single-sample dataset
# and when trying to run on multiple devices (like GPUs), this will break
export CUDA_VISIBLE_DEVICES=0
But the DeepSpeech readme suggests using that script to train on GPUs. Should the DeepSpeech Readme surrounding "Training a Model..../bin/run-ldc93s1.sh..." be updated in some manner (e.g. that if running on gpu, that one line of run-ldc93s1.sh needs to be edited first)? If so, I'd be happy to take out a PR to try and make that clarification in the readme, if that would be helpful.
Thanks again to both of you for the help!
That line does not need to be edited first, we run it without changes on systems with and without GPUs with no problems.
Interesting, I wonder why in my case it's necessary. I just confirmed that with original /bin/run-ldc93s1.sh I get the original error I posted, and that once again changing that single line of code and rerunning /bin/run-ldc93s1.sh trains successfully.
Just out of curiosity, do you run/test deepspeech on any RTX-series GPUs? I read somewhere today that the RTX 2060/2070 seem to be implicated frequently in these mystery cuda/cudnn errors, and I do have an RTX 2060...
In any case, my issue is solved so I will close this. Thanks again.
Yes, I'm frequently training on RTX2080TI. have you tried the growth option?
Interesting. I have not tried the growth option yet after the updating CUDA_VISIBLE_DEVICES solved my issue, but I will try it soon (with CUDA_VISIBLE_DEVICES=0) and report back here.
I don't think that change is solving your issue, it's just hiding all GPUs and thus disabling use of CUDA entirely.
Let's wait for this, because I don't buy this solution. As reuben said, we do train successfully this script...
Ok I tried to set allow_growth to True, but I'm still getting the same error with CUDA_VISIBLE_DEVICES=0. It's possible I am setting allow_growth wrong, but I don't think I am?
Since ./bin/run-ldc93s1.sh calls python -u DeepSpeech.py and DeepSpeech.py imports the session config from utils/config.py, I tried the following:
add c.session_config.gpu_options.allow_growth = True directly under the following line in utils/config.py:
reran ./bin/run-ldc93s1.sh from shell.
* add `c.session_config.gpu_options.allow_growth = True` directly under the following line in `utils/config.py`:
At least that looks like it is consistent with r1.13 codebase
@MaxPowerWasTaken maybe try do set LD_DEBUG=all in your env when running, to catch where CuDNN comes from. Maybe there's some mixup somewhere, and it is not the one you want that loads.
Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py
...and now ./bin/run-ldc93s1.sh trains without errors.
Ok finally got it to work (still with
CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added:os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'to top of DeepSpeech.py...and now
./bin/run-ldc93s1.shtrains without errors.
So it would confirm it's this allow_growth and just that your way of setting it was wrong. i'd like to understand better what that option does, and if we should use it.
Just out of curiosity, could you share the output of nvidia-smi before and during the bin/run-ldc93s1.sh execution?
(base) mepstein@pop-os:~$ nvidia-smi
Tue Jun 25 21:16:52 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 On | 00000000:01:00.0 Off | N/A |
| N/A 42C P8 3W / N/A | 324MiB / 5904MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1552 G /usr/lib/xorg/Xorg 27MiB |
| 0 1719 G /usr/bin/gnome-shell 55MiB |
| 0 2795 G /usr/lib/xorg/Xorg 128MiB |
| 0 3083 G cinnamon 60MiB |
| 0 3370 G ...quest-channel-token=6465864532133321617 50MiB |
+-----------------------------------------------------------------------------+
...
(base) mepstein@pop-os:~$ cd DeepSpeech/
(base) mepstein@pop-os:~/DeepSpeech$ source dsenv/bin/activate
(dsenv) (base) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh
...100 epochs in, while still training, in another shell tab:
(dsenv) (base) mepstein@pop-os:~/DeepSpeech$ nvidia-smi
Tue Jun 25 21:21:29 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 On | 00000000:01:00.0 Off | N/A |
| N/A 41C P5 12W / N/A | 875MiB / 5904MiB | 23% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1552 G /usr/lib/xorg/Xorg 27MiB |
| 0 1719 G /usr/bin/gnome-shell 55MiB |
| 0 2795 G /usr/lib/xorg/Xorg 128MiB |
| 0 3083 G cinnamon 59MiB |
| 0 3370 G ...quest-channel-token=6465864532133321617 49MiB |
| 0 19352 C python 541MiB |
+-----------------------------------------------------------------------------+
Ok finally got it to work (still with
CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added:os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'to top of DeepSpeech.py...and now
./bin/run-ldc93s1.shtrains without errors.
Thanks !
So from looking into tensorflow source code: Grow GPU memory as needed at the cost of fragmentation. I'm not a big fan of forcing that by default. And given the nvidia-smi outputs, I'm unsure why it failed in the first place.
@MaxPowerWasTaken Could you please do the LD_DEBUG=all thing, please? We still don't know if there's something else touching cudnn.
Hi lissyx,
Here's a link to my full log from rerunning ./bin/run-ldc93s1.sh after replacing os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' with os.environ['LD_DEBUG'] = 'all' in my DeepSpeech.py file:
https://gist.github.com/MaxPowerWasTaken/27a578bd7592077c9658af5981aab996
Hi lissyx,
Here's a link to my full log from rerunning
./bin/run-ldc93s1.shafter replacingos.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'withos.environ['LD_DEBUG'] = 'all'in my DeepSpeech.py file:https://gist.github.com/MaxPowerWasTaken/27a578bd7592077c9658af5981aab996
Weird, there is no mention at all of libcuda or libcudnn being loaded. But close to the error, we can see some miniconda3 libs being loaded and used:
30172: binding file /home/mepstein/miniconda3/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so [0] to /home/mepstein/miniconda3/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so [0]: normal symbol `PyInit__posixsubprocess'
@MaxPowerWasTaken Could you try to make a conda-free virtual environment ?
Hi lissyx,
I completely removed miniconda from my machine, then reran. I seem to be getting a very similar log, though confirmed no references to conda anymore.
https://gist.github.com/MaxPowerWasTaken/71366e8db01e2d07883548b4844a7700
Hi lissyx,
I completely removed miniconda from my machine, then reran. I seem to be getting a very similar log, though confirmed no references to conda anymore.
https://gist.github.com/MaxPowerWasTaken/71366e8db01e2d07883548b4844a7700
Can you try a different python version ? But at some point, I would really advise sharing that in an upstream issue.
Weird, there's no mention of libcudnn or any CUDA libraries at all in that log, like it's not even trying to load them.
Ok finally got it to work (still with
CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added:os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'to top of DeepSpeech.py...and now
./bin/run-ldc93s1.shtrains without errors.
Thank you! I've been dealing with this problem for a LONG time and this finally solved it.
@MaxPowerWasTaken I guess a note in README would be a good PR at least, we've got already two people chiming in on your solution ...
Hi lissyx,
I ran with Python3.7. Happy to update my pipfile to another version if you have one in mind but I'm not optimistic it'll fix it?
I may chime in on one of the upstream tensorflow issues, but the issues I've found (one or two you linked to here) are closed and all proposed solutions seem to be either:
Hopefully eventually I'll come to enough of an understanding of how cuda/cudnn/tf work together to have more insight into this sort of thing. But at the moment it all seems very mysterious to me.
If you'd like me to submit a PR with a note in the readme with references to those upstream tf issues and the allow_growth "solution" that worked for me, let me know and I'll submit one.
changing cuda or cudnn or tf version to something outside of the DeepSpeech required versions.
There is no such thing, we just point to TensorFlow's versions.
If you'd like me to submit a PR with a note in the readme with references to those upstream tf issues and the allow_growth "solution" that worked for me, let me know and I'll submit one.
That could be a nice middle ground in the mean time, yes.
I may chime in on one of the upstream tensorflow issues, but the issues I've found (one or two you linked to here) are closed and all proposed solutions seem to be either:
it seems pretty obvious to me the error message is misleading and there is another root in your case.
re:
changing cuda or cudnn or tf version to something outside of the DeepSpeech required versions.
^ There is no such thing, we just point to TensorFlow's versions.
...this might be a key? So the DeepSpeech readme says "The GPU capable builds (Python, NodeJS, C++, etc) depend on the same CUDA runtime as upstream TensorFlow. Currently with TensorFlow 1.13 it depends on CUDA 10.0 and CuDNN v7.5."
But this TF page (https://www.tensorflow.org/install/gpu - see "install cuda with apt" / "Ubuntu 18.04 (CUDA 10)" section) suggests installing cudnn7.6 (not 7.5) for tensorflow on gpu. Although that presumably refers to the tensorflow-gpu version on pypi, which is 1.14 not 1.13.
Would it be worthwhile for me to try installing tensorflow-gpu 1.14 and libcudnn7.6 and then rerunning DeepSpeech? Or would DeepSpeech not be expected to work on tensorflow-gpu 1.14?
But this TF page (https://www.tensorflow.org/install/gpu - see "install cuda with apt" / "Ubuntu 18.04 (CUDA 10)" section) suggests installing cudnn7.6 (not 7.5) for tensorflow on gpu. Although that presumably refers to the tensorflow-gpu version on pypi, which is 1.14 not 1.13.
We did copy from the TensorFlow pages at the time of r1.13, after having numeral issues opened by people not using the proper version and not following our link to their docs. So except if we made a mistake or if they did, yes, it's much more likely that r1.14 is the reason for v7.6.
Would it be worthwhile for me to try installing tensorflow-gpu 1.14 and libcudnn7.6 and then rerunning DeepSpeech? Or would DeepSpeech not be expected to work on tensorflow-gpu 1.14?
Well you may have troubles with ds_ctcdecoder package, since it's built against r1.13, but that's the only thing that should really block you, if it does.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Ok finally got it to work (still with
CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added:os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'to top of DeepSpeech.py...and now
./bin/run-ldc93s1.shtrains without errors.