Deeplearning4j: Core dump from benchmark repo with v100 snapshots, CPU

Created on 16 Jul 2019  路  20Comments  路  Source: eclipse/deeplearning4j

Issue Description

Core dumps when running the BenchmarkCustom class from the benchmarking repo here. Running profile v100snapshots and CPU. Attaching the made up images I was using to test.
images.tar.gz

Additional Information

StdOut: https://gist.github.com/eraly/8336c076ddd0fc76f8bd35200b1c0e53
pid log file: https://gist.github.com/eraly/95d327b257e759f160c83b645271fc83

Bug DevOps LIBND4J Release

All 20 comments

@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?

I tried locally (Windows) - with the provided dataset - no issues.
This was locally built (SkymindIO/deeplearning4j master), not using sonatype snapshots.
--datasetPath c:/Temp/benchmark/images --numLabels 3 --trainBatchSize 64

@eraly Was this locally built, or with snapshots?

@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?

yeap

I build with snapshots.

@sshepel Any updates, what does it look like?

How often do we release snapshots now? Wondering when current master will make it into snapshots.

@eraly we doing this on a daily basis...

Yes, snapshots go out daily (at least) but at present a lot of our work is on the SkymindIO/deeplearning4j fork. That means snapshots are based on when we last merged back, not on the latest work.

@eraly @saudet I have managed to run benchmark on MacOS for native backend, without any crash...

o.d.b.BaseBenchmark - ===== Benchmarking forward/backward pass =====
o.d.b.BaseBenchmark - Completed 100 iterations
o.d.b.BaseBenchmark - Completed 200 iterations
o.d.b.BaseBenchmark - =============================
o.d.b.BaseBenchmark - ===== Benchmark Results =====
o.d.b.BaseBenchmark - =============================

==========================================================================================
LayerName (LayerType)                 nIn,nOut    TotalParams   ParamsShape               
==========================================================================================
cnn1 (ConvolutionLayer)               3,96        34944         b:{1,96}, W:{96,3,11,11}  
layer1 (LocalResponseNormalization)   -,-         0             -                         
maxpool1 (SubsamplingLayer)           -,-         0             -                         
cnn2 (ConvolutionLayer)               96,256      614656        b:{1,256}, W:{256,96,5,5} 
maxpool2 (SubsamplingLayer)           -,-         0             -                         
layer5 (LocalResponseNormalization)   -,-         0             -                         
cnn3 (ConvolutionLayer)               256,384     885120        b:{1,384}, W:{384,256,3,3}
cnn4 (ConvolutionLayer)               384,384     1327488       b:{1,384}, W:{384,384,3,3}
cnn5 (ConvolutionLayer)               384,256     884992        b:{1,256}, W:{256,384,3,3}
maxpool3 (SubsamplingLayer)           -,-         0             -                         
ffn1 (DenseLayer)                     9216,4096   37752832      W:{9216,4096}, b:{1,4096} 
ffn2 (DenseLayer)                     4096,4096   16781312      W:{4096,4096}, b:{1,4096} 
output (OutputLayer)                  4096,1000   4097000       W:{4096,1000}, b:{1,1000} 
------------------------------------------------------------------------------------------
            Total Parameters:  62378344
        Trainable Parameters:  62378344
           Frozen Parameters:  0
==========================================================================================

Version                                    1.0.0-beta3                                  
Name                                       ALEXNET                                      
Description                                SIMULATEDCNN 32x3x224x224                    
Operating System                           Apple macOS 10.12.6                          
Devices                                    Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz    
CPU Cores                                  8                                            
CPU                                        Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz    
System Memory                              17179869184 - 16 G                           
Memory Config - XMX                        3817865216 - 3.56 G                          
Memory Config - JavaCPP MaxPhysicalBytes   7635730432 - 7.11 G                          
Backend                                    CPU                                          
ND4J DataType                              FLOAT                                        
BLAS Vendor                                MKL                                          
CUDA Version                               n/a                                          
CUDNN Version                              n/a                                          
Periodic GC enabled                        true                                         
Periodic GC frequency                      5000                                         
Occasional GC Freq                         0                                            
Parallel Wrapper                           false                                        
Total Params                               62378344                                     
Total Layers                               13                                           
Avg Feedforward (ms)                       1065.35                                      
Avg Backprop (ms)                          2111.14                                      
Avg Fit (ms)                               3386.84                                      
Avg Iteration (ms)                         3370.25                                      
Avg Samples/sec                            9.41                                         
Avg Batches/sec                            0.3                                          
Batch size                                 32     

@sshepel The core dump is from the v100 snapshot profile. The run log says the profile run was with 1.0.0-beta3. Can you rerun with the snapshot profile? I see it crash pretty consistently on my mac.

Also Intel is reporting this same crash during their runs. I am not sure what linux they are running on.

I know @sshepel decided to upgrade to GCC 8 on Mac recently, but I doubt that works well with Xcode 8.3. We'll probably need to upgrade Xcode to something more recent as well.

@sshepel Intel got back to us to let us know they are running on Ubuntu 18.04.2 LTS when they see the coredump. This is a zip of what they are running.
Without_MKLDNN.zip

Rebuilt it couple of times, manually and through maven. The only issue i see here is suspicious memory fluctuation. Also, i wasn't able to run with batch size of 128 - just not enough memory. For batch sizes 16 and 32 there's no issues for me on Ubuntu 18.04

New snapshots with a bunch of fixes should have gone up yesterday (we merged the Skymind fork back to eclipse master). It's quite possible this was been fixed in that round, so it's worth re-running/re-testing with the latest snapshots.

I'll update issue when Intel gets back to us.

They are still seeing the issue. They are running on an "Intel Xeon scalable processor powered CPU cluster on Colfax."

Okay. Let's spin up F-class machine on Azure, and try to reproduce there.

We reproduced this on the azure instance.

Issue resolved and fixed in dl4j-benchmarks.

Was this page helpful?
0 / 5 - 0 ratings