Hello,
Lately I've been struggling with the following error:
Exception in thread "UniGC thread 4" Exception in thread "UniGC thread 2" Exception in thread "UniGC thread 3" Exception in thread "UniGC thread 5" Exception in thread "UniGC thread 1" java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
Exception in thread "UniGC thread 0" java.lang.RuntimeException: cudaEventSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.eventSynchronize(Native Method)
at org.nd4j.jita.allocator.pointers.cuda.cudaEvent_t.synchronize(cudaEvent_t.java:69)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillFinished(SynchronousFlowController.java:132)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillFinished(GridFlowController.java:63)
at org.nd4j.jita.flow.impl.SynchronousFlowController.waitTillReleased(SynchronousFlowController.java:229)
at org.nd4j.jita.flow.impl.GridFlowController.waitTillReleased(GridFlowController.java:78)
at org.nd4j.jita.allocator.impl.AtomicAllocator$UnifiedGarbageCollectorThread.run(AtomicAllocator.java:716)
I haven't found a pattern to determine why my neural network stops working during training, as it has been runnin within an optimization problem. The dataset is not big at all (a file of 400 KB). In fact I've been using the same neural network to train larger files without a lot of problems (sometimes the same error used to appear with the larger files but it was a sparse error, with the short dataset it is much more frequent to the fact I can't finish the optimization). And the neural network is quite simple:
Branch 1: Input 1-> Sequential Embeddings + GlobalPooling -> Out1
Branch 2: Input 2
Common branch: Merger(Out1 + Input2) + Dense Layer + Output Layer
As you can see, I'm using a ComputationGraph.
At the beginning I though I was running out of memory, but I'm not sure that it is the case as the data set is smaller (although the richness of the input might be greater).
Is there any way to determine if the causes are due to lack of memory?
Some information that you might need:
Aha! Link: https://skymindai.aha.io/features/ND4J-149
this exception means CUDA kernel crashed.
Post gist of console output please, and neural network you have there. Pom.xml as well.
Have the same issue with Dl4j beta3, Win 10 and RTX 2080ti, Cuda 10.0.
@sascha08-15 can you send me some code that reproduces your problem?
Could recreate a similar (same) issue under Linux (Ubuntu).
Invoking (many times) concat seems to be related (see https://github.com/deeplearning4j/deeplearning4j/issues/6479 incl. unit test to recreate the bug). Trying to isolate the problem mentioned here next.
Hm.
Are you 100% sure your issue caused by repeated concat calls?
Could recreate a similar (same) issue under Linux (Ubuntu).
Invoking (many times) concat seems to be related (see https://github.com/deeplearning4j/deeplearning4j/issues/6479 incl. unit test to recreate the bug).
`void testBug6663(){
Nd4j.setDataType(DataBuffer.Type.DOUBLE);
INDArray arr = Nd4j.rand(14000,1);
INDArray row = Nd4j.ones(1);
INDArray newArr = arr;
for(int i=0;i<20_000;i++){
newArr = Nd4j.vstack(newArr, row);
}
}`
Recreates the situation
Thanks. We're testing new concat impl right now.
New concat implementation was merged, issue should be resolved now.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
New concat implementation was merged, issue should be resolved now.