Wav2letter: cannot run Decoding on Tesla T4

Created on 18 Jun 2019  路  14Comments  路  Source: flashlight/wav2letter

Hello!

Firstly I want to thank you for great work you done!

I had already successfully trained my model and had no problems to run Decoder (CUDA) in CUDA-docker on Titan GTX.
But doing the same on Tesla T4 turns into Error (with CPU-docker it works correctly):

after running ./Decoder --flagsfile decode_flags.cfg

what(): ArrayFire Exception (Internal error:998):
In function cuda::Kernel cuda::buildKernel(int, const string&, const string&, const std::vector >&, bool)
In file src/backend/cuda/nvrtc/cache.cpp:160
NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION

In function T* af::array::device() const [with T = void]
In file src/api/cpp/array.cpp:941
* Aborted at 1560852631 (unix time) try "date -d @1560852631" if you are using GNU date
PC: @ 0x7f8d46aa0428 gsignal
SIGABRT (@0x6e) received by PID 110 (TID 0x7f8d9093d600) from PID 110; stack trace: *
@ 0x7f8d4e54f390 (unknown)
@ 0x7f8d46aa0428 gsignal
@ 0x7f8d46aa202a abort
@ 0x7f8d473e384d __gnu_cxx::__verbose_terminate_handler()
@ 0x7f8d473e16b6 (unknown)
@ 0x7f8d473e1701 std::terminate()
@ 0x7f8d473e1919 __cxa_throw
@ 0x7f8d69a05588 af::array::device<>()
@ 0x6165a5 fl::DevicePtr::DevicePtr()
@ 0x64e41e fl::conv2d()
@ 0x629116 fl::Conv2D::forward()
@ 0x636c7f fl::UnaryModule::forward()
@ 0x6280e2 fl::Sequential::forward()
@ 0x41b0f4 main
@ 0x7f8d46a8b830 __libc_start_main
@ 0x475d19 _start
@ 0x0 (unknown)

NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION

info about my system:
NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2
NVRM version: NVIDIA UNIX x86_64 Kernel Module 430.14 Wed May 8 01:10:53 UTC 2019
GCC version: gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)

During running tests in wav2letter:
The following tests FAILED:
1 - W2lCommonTest (SEGFAULT)
2 - CriterionTest (SEGFAULT)
3 - Seq2SeqTest (SEGFAULT)
4 - AttentionTest (SEGFAULT)
5 - WindowTest (SEGFAULT)
6 - DataTest (Failed)
18 - W2lModuleTest (SEGFAULT)
19 - RuntimeTest (Failed)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: * [test] Error 8

During running tests in flashlight:
The following tests FAILED:
1 - AutogradTest (SEGFAULT)
2 - OptimTest (SEGFAULT)
3 - ModuleTest (SEGFAULT)
4 - SerializationTest (SEGFAULT)
5 - UtilsTest (Failed)
6 - DatasetTest (SEGFAULT)
7 - MeterTest (Failed)
8 - AllReduceTest (SEGFAULT)
9 - ContribModuleTest (SEGFAULT)
10 - ContribSerializationTest (Failed)
Errors while running CTest
Makefile:71: recipe for target 'test' failed
As it mentioned in #314 -- switching to cuda-0b16293 did not work for me.

I will be grateful for any help. Thank you!

build

Most helpful comment

hi @C5YS,

I have built docker images with cuda 10.0 for you. Please have a try with them, use

sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-10-latest

All 14 comments

Hi @nestyme,

This error could be the issue with GPU driver. Please, check https://github.com/facebookresearch/wav2letter/issues/229 (which looks like the same issue). Could you repeat the steps suggested in https://github.com/facebookresearch/wav2letter/issues/229?

Hi @tlikhomanenko!
Thanks a for your help
I solved this problem with building from source ArrayFire 3.6.4 version and another libraries. This problem appeared because Tesla T 4 supports only CUDA 10.0 and 10.1 versions, but ArrayFire in wav2letter++ <3.6.2 version

@nestyme

The latest docker images are built with arrayfire 3.6.4 so you can use them now too.

Hello,
I have the same problem. I use docker-nvidia (sudo docker run - runtime = nvidia - rm -itd --ipc = host --name w2l wav2letter / wav2letter: cuda-latest). The specifications of the system is:
-Ubuntu 18.04
-Nvidia 2080ti
-Driver Version: 418.67
-CUDA 10

When testing ...

Running tests ...

Test project / root / wav2letter / build
聽聽聽聽聽聽Start 1: W2lCommonTest
聽1/19 Test # 1: W2lCommonTest .................... *** Exception: SegFault 2.19 sec
聽聽聽聽聽聽Start 2: CriterionTest
聽2/19 Test # 2: CriterionTest .................... *** Exception: SegFault 1.29 sec
聽聽聽聽聽聽Start 3: Seq2SeqTest
聽3/19 Test # 3: Seq2SeqTest ...................... *** Exception: SegFault 1.25 sec
聽聽聽聽聽聽Start 4: AttentionTest
聽4/19 Test # 4: AttentionTest .................... *** Failed 1.65 sec
聽聽聽聽聽聽Start 5: WindowTest
聽5/19 Test # 5: WindowTest ....................... *** Exception: SegFault 1.19 sec
聽聽聽聽聽聽Start 6: DataTest
聽6/19 Test # 6: DataTest ......................... *** Exception: Other 1.34 sec
聽聽聽聽聽聽Start 7: DecoderTest
聽7/19 Test # 7: DecoderTest ...................... Passed 1.02 sec
聽聽聽聽聽聽Start 8: CeplifterTest
聽8/19 Test # 8: CeplifterTest .................... Passed 0.09 sec
聽聽聽聽聽聽Start 9: DctTest
聽9/19 Test # 9: DctTest .......................... Passed 0.18 sec
聽聽聽聽聽聽Start 10: DerivativesTest
10/19 Test # 10: DerivativesTest .................. Passed 0.11 sec
聽聽聽聽聽聽Start 11: DitherTest
11/19 Test # 11: DitherTest ....................... Passed 8.10 sec
聽聽聽聽聽聽Start 12: MfccTest
12/19 Test # 12: MfccTest ......................... Passed 0.21 sec
聽聽聽聽聽聽Start 13: PreEmphasisTest
13/19 Test # 13: PreEmphasisTest .................. Passed 0.10 sec
聽聽聽聽聽聽Start 14: SoundTest
14/19 Test # 14: SoundTest ........................ Passed 0.14 sec
聽聽聽聽聽聽Start 15: SpeechUtilsTest
15/19 Test # 15: SpeechUtilsTest .................. Passed 2.43 sec
聽聽聽聽聽聽Start 16: TriFilterbankTest
16/19 Test # 16: TriFilterbankTest ................ Passed 0.13 sec
聽聽聽聽聽聽Start 17: WindowingTest
17/19 Test # 17: WindowingTest .................... Passed 0.05 sec
聽聽聽聽聽聽Start 18: W2lModuleTest
18/19 Test # 18: W2lModuleTest .................... *** Exception: SegFault 3.54 sec
聽聽聽聽聽聽Start 19: RuntimeTest
19/19 Test # 19: RuntimeTest ...................... *** Failed 9.23 sec

58% tests passed, 8 tests failed out of 19

Total Test time (real) = 34.31 sec

The following tests FAILED:
1 - W2lCommonTest (SEGFAULT)
2 - CriterionTest (SEGFAULT)
3 - Seq2SeqTest (SEGFAULT)
4 - AttentionTest (Failed)
5 - WindowTest (SEGFAULT)
6 - DataTest (OTHER_FAULT)
18 - W2lModuleTest (SEGFAULT)
19 - RuntimeTest (Failed)
Errors while running CTest
Makefile: 104: recipe for target 'test' failed
make: *** [test] Error 8

nvcc -V docker: Cuda compilation tools, release 9.2, V9.2.148
nvcc -V without docker: Cuda compilation tools, release 10.0, V10.0.130

I do not know if it affects something ...

Thanks so much for reading.

Hi @C5YS,

Could you run each test separately to make sure that the error looks like this NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION?

Here is the compatibility https://github.com/NVIDIA/nvidia-docker/wiki/CUDA, actually you need to have necessary driver version to support cuda 9.2 in docker. But the problem comes from

I solved this problem with building from source ArrayFire 3.6.4 version and another libraries. This problem appeared because Tesla T 4 supports only CUDA 10.0 and 10.1 versions, but ArrayFire in wav2letter++ <3.6.2 version

so your GPU supports only CUDA 10. I think the simplest way is to try to rebuild all images from Dockerfiles (for flashlight, base and gpu, then for wav2letter, base and gpu) with changing the version of nvidia docker image to cuda 10. Could you do this? Do you need more detailed instruction how to do this?

Thank you very much for answering, @tlikhomanenko.
Could you give me more detailed instructions on the reconstruction of all Dockerfiles images for cuda 10, please?
I am new to these issues, and I really appreciate your help.

hi @C5YS
Maybe it will be easier to build all from source -- I found detailed tutorial how to do that: https://medium.com/@shaheenkader/how-to-install-wav2letter-dc94c3b74e97

hi @C5YS,

I have built docker images with cuda 10.0 for you. Please have a try with them, use

sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-10-latest

Hello, thank you very much everyone for the help.

Install the docker with the version cuda 10, and at the time of training it generates the following error:

*** Aborted at 1563327542 (unix time) try "date -d @ 1563327542" if you are using GNU date ***
PC: @ 0x7f400e5b5740 GpuCTC <> :: setup_gpu_metadata ()
*** SIGSEGV (@ 0xffffffff5e79461c) received by PID 1238 (TID 0x7f400ee72600) from PID 1585006108; stack trace: ***
聽聽聽聽@ 0x7f3fc953c390 (unknown)
聽聽聽聽@ 0x7f400e5b5740 GpuCTC <> :: setup_gpu_metadata ()
聽聽聽聽@ 0x7f400e5b59f2 GpuCTC <> :: compute_cost_and_score ()
聽聽聽聽@ 0x7f400e5b2d5d compute_ctc_loss
聽聽聽聽@ 0x56bd8a w2l :: ConnectionistTemporalClassificationCriterion :: forward ()
聽聽聽聽@ 0x47df7c _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_S9_ddbi.constprop.11262
聽聽聽聽@ 0x41b752 main
聽聽聽聽@ 0x7f3fc167a830 __libc_start_main
聽聽聽聽@ 0x479279 _start
聽聽聽聽@ 0x0 (unknown)
Segmentation fault (core dumped)

the same error as: #223

and, when changing "--criterion" from ctc to asg, I have the following error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  Unknown index in dictionary '1024674700'
*** Aborted at 1563327648 (unix time) try "date -d @1563327648" if you are using GNU date ***
PC: @     0x7fd0db681428 gsignal
*** SIGABRT (@0x5c2) received by PID 1474 (TID 0x7fd128e64600) from PID 1474; stack trace: ***
    @     0x7fd0e352e390 (unknown)
    @     0x7fd0db681428 gsignal
    @     0x7fd0db68302a abort
    @     0x7fd0dbfc484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd0dbfc26b6 (unknown)
    @     0x7fd0dbfc2701 std::terminate()
    @     0x7fd0dbfc2919 __cxa_throw
    @           0x5580db _ZNK3w2l10Dictionary8getEntryB5cxx11Ei
    @           0x564171 _ZN3w2l10tknIdx2LtrB5cxx11ERKSt6vectorIiSaIiEERKNS_10DictionaryE
    @           0x56619d _ZN3w2l17tknPrediction2LtrB5cxx11ESt6vectorIiSaIiEERKNS_10DictionaryE
    @           0x47b087 _ZZ4mainENKUlRKN2af5arrayES2_RN3w2l13DatasetMetersEE1_clES2_S2_S5_
    @           0x47e188 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.11262
    @           0x41b752 main
    @     0x7fd0db66c830 __libc_start_main
    @           0x479279 _start
    @                0x0 (unknown)
Aborted (core dumped)

The same as: #349

I do not think the problem is the data set with which I train (it has worked on other occasions with cpu and gpu-cuda 9.2).

Hi @C5YS,

Could you at first run all tests from flashlight and wav2letter to be sure that the previous problem with CUDA version is resolved?

If all tests now pass, please, open a new issue with the above comments on errors (and specify what docker image and what GPU type you are using).

Hi @tlikhomanenko.
When I run: "cd / root / wav2letter / build && make test" it shows me the following:
```Running tests...
Test project /root/wav2letter/build
Start 1: W2lCommonTest
1/20 Test #1: W2lCommonTest .................... Passed 10.60 sec
Start 2: DictionaryTest
2/20 Test #2: DictionaryTest ................... Passed 0.14 sec
Start 3: CriterionTest
3/20 Test #3: CriterionTest ....................*Exception: SegFault 1.69 sec
Start 4: Seq2SeqTest
4/20 Test #4: Seq2SeqTest ...................... Passed 13.38 sec
Start 5: AttentionTest
5/20 Test #5: AttentionTest .................... Passed 3.73 sec
Start 6: WindowTest
6/20 Test #6: WindowTest ....................... Passed 2.78 sec
Start 7: DataTest
7/20 Test #7: DataTest ......................... Passed 1.38 sec
Start 8: SoundTest
8/20 Test #8: SoundTest ........................ Passed 0.30 sec
Start 9: DecoderTest
9/20 Test #9: DecoderTest ...................... Passed 1.08 sec
Start 10: CeplifterTest
10/20 Test #10: CeplifterTest .................... Passed 0.11 sec
Start 11: DctTest
11/20 Test #11: DctTest .......................... Passed 0.26 sec
Start 12: DerivativesTest
12/20 Test #12: DerivativesTest .................. Passed 0.11 sec
Start 13: DitherTest
13/20 Test #13: DitherTest ....................... Passed 8.11 sec
Start 14: MfccTest
14/20 Test #14: MfccTest ......................... Passed 0.38 sec
Start 15: PreEmphasisTest
15/20 Test #15: PreEmphasisTest .................. Passed 0.22 sec
Start 16: SpeechUtilsTest
16/20 Test #16: SpeechUtilsTest .................. Passed 1.34 sec
Start 17: TriFilterbankTest
17/20 Test #17: TriFilterbankTest ................ Passed 0.14 sec
Start 18: WindowingTest
18/20 Test #18: WindowingTest .................... Passed 0.10 sec
Start 19: W2lModuleTest
19/20 Test #19: W2lModuleTest .................... Passed 3.62 sec
Start 20: RuntimeTest
20/20 Test #20: RuntimeTest ...................... Passed 2.06 sec

95% tests passed, 1 tests failed out of 20

Total Test time (real) = 51.63 sec

The following tests FAILED:
3 - CriterionTest (SEGFAULT)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: * [test] Error 8

Test from flashlight:

~/flashlight/build# make test
Running tests...
Test project /root/flashlight/build
Start 1: AutogradTest
1/10 Test #1: AutogradTest ..................... Passed 40.54 sec
Start 2: OptimTest
2/10 Test #2: OptimTest ........................ Passed 1.53 sec
Start 3: ModuleTest
3/10 Test #3: ModuleTest ....................... Passed 4.35 sec
Start 4: SerializationTest
4/10 Test #4: SerializationTest ................ Passed 8.11 sec
Start 5: UtilsTest
5/10 Test #5: UtilsTest ........................ Passed 0.94 sec
Start 6: DatasetTest
6/10 Test #6: DatasetTest ...................... Passed 2.88 sec
Start 7: MeterTest
7/10 Test #7: MeterTest ........................ Passed 0.99 sec
Start 8: AllReduceTest
8/10 Test #8: AllReduceTest .................... Passed 1.98 sec
Start 9: ContribModuleTest
9/10 Test #9: ContribModuleTest ................ Passed 3.58 sec
Start 10: ContribSerializationTest
10/10 Test #10: ContribSerializationTest ......... Passed 3.05 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) = 67.96 sec
```
Thanks for the help, I really appreciate it.

@C5YS,
the issue with what(): Unknown index in dictionary '1024674700' is resolved https://github.com/facebookresearch/wav2letter/issues/349.

Didn't update docker images yet, but you can go into container and update the wav2letter folder, rerun cmake and make inside it.

for issue with CTC, please look at https://github.com/facebookresearch/wav2letter/issues/370 (still in progress)

@C5YS,
the issue with what(): Unknown index in dictionary '1024674700' is resolved #349.

Didn't update docker images yet, but you can go into container and update the wav2letter folder, rerun cmake and make inside it.

Anyone rebuilding from inside the CUDA 10 docker, apart from pulling the latest wav2letter, you'll also need to pull, build and install the latest flashlight

When building flashlight base using "nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04", you may need to remove the very last line
"ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1"

At least when I pulled the cuda:10.0 image, the file already exists, so you'll get a File Exists error when linking.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gauenk picture gauenk  路  3Comments

nutriver picture nutriver  路  3Comments

ekorudi picture ekorudi  路  5Comments

hajix picture hajix  路  4Comments

JanX2 picture JanX2  路  5Comments