Deeplearning4j: cudaStreamSynchronize(...) failed

Created on 19 Apr 2019  路  31Comments  路  Source: eclipse/deeplearning4j

Issue Description

Possible dupe of #6892.

When running an LSTM network on GPU, it crashes with a RuntimeException:

CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
Exception in thread "main" java.lang.RuntimeException: cudaStreamSynchronize(...) failed
        at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
        at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
        at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:2725)
        at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1378)
        at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1338)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.graph.ComputationGraph.doTruncatedBPTT(ComputationGraph.java:3612)
        at org.deeplearning4j.nn.graph.ComputationGraph.fitHelper(ComputationGraph.java:1153)
        at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1112)
        at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1079)
        at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1015)
        at com.luiwammes.pc.trainer.Application.main(Application.java:71)
        Suppressed: java.lang.RuntimeException: cudaStreamSynchronize(...) failed
                at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
                at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
                at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
                at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1419)
                ... 10 more

Additional output:

10:19:35.276 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:19:40.482 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 32
10:19:48.892 [main] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 0
10:19:48.909 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
10:19:48.909 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [4]; Memory: [4.9GB];
10:19:48.909 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:19:48.911 [main] INFO  o.n.l.j.o.e.CudaExecutioner - Device Name: [Tesla K80]; CC: [3.7]; Total/free memory: [11996954624]
10:19:48.983 [main] INFO  o.d.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
10:19:59.515 [main] INFO  com.luiwammes.pc.trainer.Application - 
========================================================================================================================
VertexName (VertexType)           nIn,nOut   TotalParams   ParamsShape                               Vertex Inputs      
========================================================================================================================
input (InputVertex)               -,-        -             -                                         -                  
layer-0 (LSTM)                    80,512     1214464       W:{80,2048}, RW:{512,2048}, b:{1,2048}    [input]            
layer-1 (LSTM)                    512,512    2099200       W:{512,2048}, RW:{512,2048}, b:{1,2048}   [layer-0]          
outputLayer-merge (MergeVertex)   -,-        -             -                                         [layer-0, layer-1] 
outputLayer (RnnOutputLayer)      1024,80    82000         W:{1024,80}, b:{1,80}                     [outputLayer-merge]
------------------------------------------------------------------------------------------------------------------------
            Total Parameters:  3395664
        Trainable Parameters:  3395664
           Frozen Parameters:  0
========================================================================================================================

Version Information

Using latest snapshots for dl4j/nd4j.

Nvidia stuff:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

$ nvidia-smi
Fri Apr 19 10:25:30 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

OS

$ uname -a
Linux gpu-instance 4.15.0-1029-gcp #31-Ubuntu SMP Thu Mar 21 09:40:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Bug LIBND4J

Most helpful comment

@saudet confirmed and fixed in my branch. Will merge with other fixed later today.

All 31 comments

Please send me some code I could run to reproduce your issue

I'll try to prepare a minimal example next week.

Thank you, that would be amazing

Finally got around to this. It happens for me when running the LSTMCharModellingExample verbatim. Output when running it:

File downloaded to /tmp/Shakespeare.txt
Loaded and converted file: 5459809 valid characters of 5465100 total characters (5291 removed)
15:03:26,449 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
15:03:26,450 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
15:03:26,451 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/tschut/textrnn/target/whatsapp-generator-1.0-SNAPSHOT-bin.jar!/logback.xml]
15:03:26,466 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@43c1b556 - URL [jar:file:/home/tschut/textrnn/target/whatsapp-generator-1.0-SNAPSHOT-bin.jar!/logback.xml] is not of type file
15:03:26,550 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
15:03:26,551 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
15:03:26,556 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
15:03:26,613 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - This appender no longer admits a layout as a sub-component, set an encoder instead.
15:03:26,613 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - To ensure compatibility, wrapping your layout in LayoutWrappingEncoder.
15:03:26,613 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - See also http://logback.qos.ch/codes.html#layoutInsteadOfEncoder for details
15:03:26,614 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [com.luiwammes.pc] to TRACE
15:03:26,614 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse] to WARN
15:03:26,614 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO
15:03:26,614 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
15:03:26,615 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
15:03:26,616 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@587e5365 - Registering current configuration as safe fallback point

15:03:26.742 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
15:03:31.566 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 32
15:03:32.010 [main] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 0
15:03:32.016 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
15:03:32.016 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [4]; Memory: [4.9GB];
15:03:32.016 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
15:03:32.017 [main] INFO  o.n.l.j.o.e.CudaExecutioner - Device Name: [Tesla K80]; CC: [3.7]; Total/free memory: [11996954624]
15:03:32.064 [main] INFO  o.d.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
Number of parameters in layer 0: 222400
Number of parameters in layer 1: 320800
Number of parameters in layer 2: 15477
Total number of network parameters: 558677
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ" 
Exception in thread "main" java.lang.RuntimeException: cudaStreamSynchronize(...) failed
        at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
        at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1952)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2684)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2627)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.doTruncatedBPTT(MultiLayerNetwork.java:2019)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2223)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2189)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2252)
        at com.luiwammes.pc.test.LSTMCharModellingExample.main(LSTMCharModellingExample.java:97)
        Suppressed: java.lang.RuntimeException: cudaStreamSynchronize(...) failed
                at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
                at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
                at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
                at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2701)
                ... 9 more

Spent 15 minutes training with and without cuDNN. No issues. Windows 10, RTX 2080 Ti, CUDA 10.1

Stop. Are you running beta3?

If you're running beta3 - you're in trouble then. You have cuda 10.0 and cuda 10.1 installed in the same time, and 10.1 is running. So it's probably just runtime versions conflict.

p.s. beta3 doesn't support CUDA 10.1. Only snapshots do.

No, latest snapshot. This is my pom https://gist.github.com/tschut/ec49468559f91d6b4e44c601927ba1e2. Note there's some non-related stuff in there since I basically plopped the CharacterExample in my project that has this issue.

It fails immediately btw when training starts, only on gpu.

Edit: duh of course only on gpu xD

Wait I see indeed there's 10.0 and 10.1... weird, I don't remember installing multiple versions and it used to run fine. Will verify tomorrow.

You definitely have both. look at nvcc output and nvidia-smi output. They report different versions. nvcc reports version of the file, but nvidia-smi reports active driver version.

I completely re-installed cuda and cuDNN 7.5.1, but am still having the problem. Only other thing I can think of is to just create a new cloud instance and start over fresh.

Versions now:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

$ nvidia-smi
Thu Apr 25 07:42:59 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Anything else I can check before I go the new instance route?

Btw, also updated my pom to use 10.1.

Show me new output from console.

Also, are you using cuDNN?

To be sure I created a new instance anyway. Installed cuda 10.1 and cuDNN 7.5.1, ran mvn and the program, same results:

Using existing text file at /tmp/Shakespeare.txt
Loaded and converted file: 5459809 valid characters of 5465100 total characters (5291 removed)
10:16:43,845 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
10:16:43,846 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
10:16:43,846 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/tschut/textrnn/target/whatsapp-generator-1.0-SNAPSHOT-bin.jar!/logback.xml]
10:16:43,858 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@2833cc44 - URL [jar:file:/home/tschut/textrnn/target/whatsapp-generator-1.0-SNAPSHOT-bin.jar!/logback.xml] is not of type file
10:16:43,930 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
10:16:43,930 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
10:16:43,936 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
10:16:43,998 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - This appender no longer admits a layout as a sub-component, set an encoder instead.
10:16:43,998 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - To ensure compatibility, wrapping your layout in LayoutWrappingEncoder.
10:16:43,998 |-WARN in ch.qos.logback.core.ConsoleAppender[STDOUT] - See also http://logback.qos.ch/codes.html#layoutInsteadOfEncoder for details
10:16:43,998 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [com.luiwammes.pc] to TRACE
10:16:43,998 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse] to WARN
10:16:43,999 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO
10:16:43,999 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
10:16:43,999 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
10:16:44,000 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@33f88ab - Registering current configuration as safe fallback point

10:16:44.147 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:16:49.224 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 32
10:16:49.723 [main] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 0
10:16:49.732 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
10:16:49.732 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.4GB];
10:16:49.733 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:16:49.734 [main] INFO  o.n.l.j.o.e.CudaExecutioner - Device Name: [Tesla K80]; CC: [3.7]; Total/free memory: [11996954624]
10:16:49.777 [main] INFO  o.d.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
Number of parameters in layer 0: 222400
Number of parameters in layer 1: 320800
Number of parameters in layer 2: 15477
Total number of network parameters: 558677
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.1/libnd4j/blas/cuda/NativeOps.cu:1904 code=700(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.1/libnd4j/blas/cuda/NativeOps.cu:1904 code=700(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.1/libnd4j/blas/cuda/NativeOps.cu:1904 code=700(<unknown>) "dZ" 
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.1/libnd4j/blas/cuda/NativeOps.cu:1904 code=700(<unknown>) "dZ" 
Exception in thread "main" java.lang.RuntimeException: cudaStreamSynchronize(...) failed
        at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
        at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1869)
        at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1952)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2684)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2627)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.doTruncatedBPTT(MultiLayerNetwork.java:2019)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2223)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2189)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2252)
        at com.luiwammes.pc.test.LSTMCharModellingExample.main(LSTMCharModellingExample.java:97)
        Suppressed: java.lang.RuntimeException: cudaStreamSynchronize(...) failed
                at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
                at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
                at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1869)
                at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
                at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2701)
                ... 9 more

I installed cuDNN, but am unsure about how to verify that it's being used?

I installed cuDNN, but am unsure about how to verify that it's being used?

The simple answer: If you're not getting logging about it being absent, it's being used.
Or check that net.getLayer(0).getHelper() is not null (after net has been used for fit/inference).

Ok, it's being used then. I could try to remove cuDNN and see if the problem persists? Otherwise perhaps it makes sense to downgrade to cuda 10.0 instead of 10.1?

Yeah, try without cuDNN. Also post the exact cuDNN version you are using.

I've tested both of the character examples locally, no issues for me.
Windows + Titan X + CUDA 10.1: Fine with and without cuDNN (Pretty sure I'm running cudnn-10.1 v7.5.0.56)
Windows + GTX970 + CUDA 10.0: Fine without cuDNN (haven't tested with yet)

I have libcudnn7_7.5.1.10-1. Removed it and now there's indeed a bunch of warnings about it not being able to load cudnn stuff. The error remains though :(

So with or without cudnn you have same exception?

Yes. I'm now trying to downgrade to 10.0.

Downgraded to 10.0 (cuda, driver, and cudnn), both nvcc and nvidia-smi report 10.0. Still get the same error, with and without cudnn. I'm so confused now.

huh... you're no alone confused here

Tried with another GPU (P100), also no luck.

Ok, I've managed to better isolate the problem.

Current environment:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

$ nvidia-smi
Thu Apr 25 14:26:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

cuDNN version: libcudnn7_7.5.1.10-1+cuda10.0_amd64.deb

If I build with dl4j versions 1.0.0-beta3, the example runs fine. If I build with 1.0.0-SNAPSHOT (no other changes), the crash happens.

My dependencies:

       <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-10.0</artifactId>
            <version>${dl4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-cuda-10.0</artifactId>
            <version>${dl4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-10.0</artifactId>
            <version>${dl4j.version}</version>
            <classifier>linux-x86_64</classifier>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-nlp</artifactId>
            <version>${dl4j.version}</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-core</artifactId>
            <version>${dl4j.version}</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-ui-model</artifactId>
            <version>${dl4j.version}</version>
            <scope>compile</scope>
        </dependency>

I think I understand what the issue is. libnd4j is no longer getting prebuilt for compute capability 3.7 and the PTX code doesn't appear to get linked either:
https://github.com/deeplearning4j/deeplearning4j/blob/master/libnd4j/blas/CMakeLists.txt#L165
Maybe we could let it include the PTX code for compute_35 and see what that gives?

@saudet confirmed and fixed in my branch. Will merge with other fixed later today.

Awesome! At least we're not crazy :)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlexDBlack picture AlexDBlack  路  4Comments

AlexDBlack picture AlexDBlack  路  5Comments

atuzhykov picture atuzhykov  路  4Comments

maxgfr picture maxgfr  路  4Comments

Storm-cev picture Storm-cev  路  5Comments