Core dumps when running the BenchmarkCustom class from the benchmarking repo here. Running profile v100snapshots and CPU. Attaching the made up images I was using to test.
images.tar.gz
StdOut: https://gist.github.com/eraly/8336c076ddd0fc76f8bd35200b1c0e53
pid log file: https://gist.github.com/eraly/95d327b257e759f160c83b645271fc83
@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?
I tried locally (Windows) - with the provided dataset - no issues.
This was locally built (SkymindIO/deeplearning4j master), not using sonatype snapshots.
--datasetPath c:/Temp/benchmark/images --numLabels 3 --trainBatchSize 64
@eraly Was this locally built, or with snapshots?
@sshepel Can you check if we can reproduce that on the Mac machine used by Jenkins?
yeap
I build with snapshots.
@sshepel Any updates, what does it look like?
How often do we release snapshots now? Wondering when current master will make it into snapshots.
@eraly we doing this on a daily basis...
Yes, snapshots go out daily (at least) but at present a lot of our work is on the SkymindIO/deeplearning4j fork. That means snapshots are based on when we last merged back, not on the latest work.
@eraly @saudet I have managed to run benchmark on MacOS for native backend, without any crash...
o.d.b.BaseBenchmark - ===== Benchmarking forward/backward pass =====
o.d.b.BaseBenchmark - Completed 100 iterations
o.d.b.BaseBenchmark - Completed 200 iterations
o.d.b.BaseBenchmark - =============================
o.d.b.BaseBenchmark - ===== Benchmark Results =====
o.d.b.BaseBenchmark - =============================
==========================================================================================
LayerName (LayerType) nIn,nOut TotalParams ParamsShape
==========================================================================================
cnn1 (ConvolutionLayer) 3,96 34944 b:{1,96}, W:{96,3,11,11}
layer1 (LocalResponseNormalization) -,- 0 -
maxpool1 (SubsamplingLayer) -,- 0 -
cnn2 (ConvolutionLayer) 96,256 614656 b:{1,256}, W:{256,96,5,5}
maxpool2 (SubsamplingLayer) -,- 0 -
layer5 (LocalResponseNormalization) -,- 0 -
cnn3 (ConvolutionLayer) 256,384 885120 b:{1,384}, W:{384,256,3,3}
cnn4 (ConvolutionLayer) 384,384 1327488 b:{1,384}, W:{384,384,3,3}
cnn5 (ConvolutionLayer) 384,256 884992 b:{1,256}, W:{256,384,3,3}
maxpool3 (SubsamplingLayer) -,- 0 -
ffn1 (DenseLayer) 9216,4096 37752832 W:{9216,4096}, b:{1,4096}
ffn2 (DenseLayer) 4096,4096 16781312 W:{4096,4096}, b:{1,4096}
output (OutputLayer) 4096,1000 4097000 W:{4096,1000}, b:{1,1000}
------------------------------------------------------------------------------------------
Total Parameters: 62378344
Trainable Parameters: 62378344
Frozen Parameters: 0
==========================================================================================
Version 1.0.0-beta3
Name ALEXNET
Description SIMULATEDCNN 32x3x224x224
Operating System Apple macOS 10.12.6
Devices Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
CPU Cores 8
CPU Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
System Memory 17179869184 - 16 G
Memory Config - XMX 3817865216 - 3.56 G
Memory Config - JavaCPP MaxPhysicalBytes 7635730432 - 7.11 G
Backend CPU
ND4J DataType FLOAT
BLAS Vendor MKL
CUDA Version n/a
CUDNN Version n/a
Periodic GC enabled true
Periodic GC frequency 5000
Occasional GC Freq 0
Parallel Wrapper false
Total Params 62378344
Total Layers 13
Avg Feedforward (ms) 1065.35
Avg Backprop (ms) 2111.14
Avg Fit (ms) 3386.84
Avg Iteration (ms) 3370.25
Avg Samples/sec 9.41
Avg Batches/sec 0.3
Batch size 32
@sshepel The core dump is from the v100 snapshot profile. The run log says the profile run was with 1.0.0-beta3. Can you rerun with the snapshot profile? I see it crash pretty consistently on my mac.
Also Intel is reporting this same crash during their runs. I am not sure what linux they are running on.
I know @sshepel decided to upgrade to GCC 8 on Mac recently, but I doubt that works well with Xcode 8.3. We'll probably need to upgrade Xcode to something more recent as well.
@sshepel Intel got back to us to let us know they are running on Ubuntu 18.04.2 LTS when they see the coredump. This is a zip of what they are running.
Without_MKLDNN.zip
Rebuilt it couple of times, manually and through maven. The only issue i see here is suspicious memory fluctuation. Also, i wasn't able to run with batch size of 128 - just not enough memory. For batch sizes 16 and 32 there's no issues for me on Ubuntu 18.04
New snapshots with a bunch of fixes should have gone up yesterday (we merged the Skymind fork back to eclipse master). It's quite possible this was been fixed in that round, so it's worth re-running/re-testing with the latest snapshots.
I'll update issue when Intel gets back to us.
They are still seeing the issue. They are running on an "Intel Xeon scalable processor powered CPU cluster on Colfax."
Okay. Let's spin up F-class machine on Azure, and try to reproduce there.
We reproduced this on the azure instance.
Issue resolved and fixed in dl4j-benchmarks.