Deeplearning4j: Core dump from benchmark repo with v100 snapshots, CPU

Created on 16 Jul 2019 · 20Comments · Source: eclipse/deeplearning4j

Issue Description

Core dumps when running the BenchmarkCustom class from the benchmarking repo here. Running profile v100snapshots and CPU. Attaching the made up images I was using to test.
images.tar.gz

Additional Information

StdOut: https://gist.github.com/eraly/8336c076ddd0fc76f8bd35200b1c0e53
pid log file: https://gist.github.com/eraly/95d327b257e759f160c83b645271fc83

Bug DevOps LIBND4J Release

Source

eraly

All 20 comments

@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?

saudet on 16 Jul 2019

I tried locally (Windows) - with the provided dataset - no issues.
This was locally built (SkymindIO/deeplearning4j master), not using sonatype snapshots.
--datasetPath c:/Temp/benchmark/images --numLabels 3 --trainBatchSize 64

@eraly Was this locally built, or with snapshots?

AlexDBlack on 17 Jul 2019

@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?

yeap

sshepel on 17 Jul 2019

👍1

I build with snapshots.

eraly on 17 Jul 2019

@sshepel Any updates, what does it look like?

saudet on 18 Jul 2019

How often do we release snapshots now? Wondering when current master will make it into snapshots.

eraly on 18 Jul 2019

@eraly we doing this on a daily basis...

sshepel on 18 Jul 2019

Yes, snapshots go out daily (at least) but at present a lot of our work is on the SkymindIO/deeplearning4j fork. That means snapshots are based on when we last merged back, not on the latest work.

AlexDBlack on 19 Jul 2019

👍1

@eraly @saudet I have managed to run benchmark on MacOS for native backend, without any crash...

o.d.b.BaseBenchmark - ===== Benchmarking forward/backward pass =====
o.d.b.BaseBenchmark - Completed 100 iterations
o.d.b.BaseBenchmark - Completed 200 iterations
o.d.b.BaseBenchmark - =============================
o.d.b.BaseBenchmark - ===== Benchmark Results =====
o.d.b.BaseBenchmark - =============================

==========================================================================================
LayerName (LayerType)                 nIn,nOut    TotalParams   ParamsShape               
==========================================================================================
cnn1 (ConvolutionLayer)               3,96        34944         b:{1,96}, W:{96,3,11,11}  
layer1 (LocalResponseNormalization)   -,-         0             -                         
maxpool1 (SubsamplingLayer)           -,-         0             -                         
cnn2 (ConvolutionLayer)               96,256      614656        b:{1,256}, W:{256,96,5,5} 
maxpool2 (SubsamplingLayer)           -,-         0             -                         
layer5 (LocalResponseNormalization)   -,-         0             -                         
cnn3 (ConvolutionLayer)               256,384     885120        b:{1,384}, W:{384,256,3,3}
cnn4 (ConvolutionLayer)               384,384     1327488       b:{1,384}, W:{384,384,3,3}
cnn5 (ConvolutionLayer)               384,256     884992        b:{1,256}, W:{256,384,3,3}
maxpool3 (SubsamplingLayer)           -,-         0             -                         
ffn1 (DenseLayer)                     9216,4096   37752832      W:{9216,4096}, b:{1,4096} 
ffn2 (DenseLayer)                     4096,4096   16781312      W:{4096,4096}, b:{1,4096} 
output (OutputLayer)                  4096,1000   4097000       W:{4096,1000}, b:{1,1000} 
------------------------------------------------------------------------------------------
            Total Parameters:  62378344
        Trainable Parameters:  62378344
           Frozen Parameters:  0
==========================================================================================

Version                                    1.0.0-beta3                                  
Name                                       ALEXNET                                      
Description                                SIMULATEDCNN 32x3x224x224                    
Operating System                           Apple macOS 10.12.6                          
Devices                                    Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz    
CPU Cores                                  8                                            
CPU                                        Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz    
System Memory                              17179869184 - 16 G                           
Memory Config - XMX                        3817865216 - 3.56 G                          
Memory Config - JavaCPP MaxPhysicalBytes   7635730432 - 7.11 G                          
Backend                                    CPU                                          
ND4J DataType                              FLOAT                                        
BLAS Vendor                                MKL                                          
CUDA Version                               n/a                                          
CUDNN Version                              n/a                                          
Periodic GC enabled                        true                                         
Periodic GC frequency                      5000                                         
Occasional GC Freq                         0                                            
Parallel Wrapper                           false                                        
Total Params                               62378344                                     
Total Layers                               13                                           
Avg Feedforward (ms)                       1065.35                                      
Avg Backprop (ms)                          2111.14                                      
Avg Fit (ms)                               3386.84                                      
Avg Iteration (ms)                         3370.25                                      
Avg Samples/sec                            9.41                                         
Avg Batches/sec                            0.3                                          
Batch size                                 32

sshepel on 19 Jul 2019

@sshepel The core dump is from the v100 snapshot profile. The run log says the profile run was with 1.0.0-beta3. Can you rerun with the snapshot profile? I see it crash pretty consistently on my mac.

eraly on 19 Jul 2019

Also Intel is reporting this same crash during their runs. I am not sure what linux they are running on.

eraly on 19 Jul 2019

I know @sshepel decided to upgrade to GCC 8 on Mac recently, but I doubt that works well with Xcode 8.3. We'll probably need to upgrade Xcode to something more recent as well.

saudet on 21 Jul 2019

@sshepel Intel got back to us to let us know they are running on Ubuntu 18.04.2 LTS when they see the coredump. This is a zip of what they are running.
Without_MKLDNN.zip

eraly on 29 Jul 2019

Rebuilt it couple of times, manually and through maven. The only issue i see here is suspicious memory fluctuation. Also, i wasn't able to run with batch size of 128 - just not enough memory. For batch sizes 16 and 32 there's no issues for me on Ubuntu 18.04

raver119 on 30 Jul 2019

New snapshots with a bunch of fixes should have gone up yesterday (we merged the Skymind fork back to eclipse master). It's quite possible this was been fixed in that round, so it's worth re-running/re-testing with the latest snapshots.

AlexDBlack on 6 Aug 2019

I'll update issue when Intel gets back to us.

eraly on 7 Aug 2019

They are still seeing the issue. They are running on an "Intel Xeon scalable processor powered CPU cluster on Colfax."

eraly on 8 Aug 2019

Okay. Let's spin up F-class machine on Azure, and try to reproduce there.

raver119 on 8 Aug 2019

👍1

We reproduced this on the azure instance.

eraly on 14 Aug 2019

Issue resolved and fixed in dl4j-benchmarks.

raver119 on 14 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

SameDiff generation of unsigned integers produces unexpected results

novog · 4Comments

[Question] Does DL4J works fine on NVidia Jetson ?

Paranormaly · 5Comments

Nd4j/Libnd4j: CI issue with TF import argmax test

AlexDBlack · 5Comments

DL4J: UI functions on Java 8 only?

Storm-cev · 5Comments

Android Build Error by using 1.0.0-beta4: Program type already present: org.opencv.android.BaseLoaderCallback$1

zhangy10 · 3Comments