This is not a bug, but I noticed that the processing time of Convolution.im2col() and a.mmul(b) in nd4j-1.0.0-beta2 is longer than the processing time in nd4j-0.9.1.
The tests were performed using below ndarrays on the same hardware. I read this page https://deeplearning4j.org/docs/latest/deeplearning4j-config-performance-debugging, but I still have no idea how to deal with this. If this is not a proper place to ask a question, please suggest where to post my question. Appreciate that.
Convolutional2DLayer with below arrays
Processing time of running 1000 cycles in Convolutional2DLayer.
The hardware is the same.
May 15, 2019 12:49:21 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 12:49:21 PM org.reflections.Reflections scan
INFO: Reflections took 165 ms to scan 64 urls, producing 40181 keys and 43883 values
May 15, 2019 12:49:21 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 12:49:21 PM org.reflections.Reflections scan
INFO: Reflections took 137 ms to scan 11 urls, producing 31 keys and 227 values
May 15, 2019 12:49:21 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 12:49:21 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
INFO: Cores: [4]; Memory: [1.7GB];
INFO: Blas vendor: [OPENBLAS]
May 15, 2019 12:33:58 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 12:33:59 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 12:33:59 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 12:33:59 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
INFO: Cores: [4]; Memory: [1.7GB];
INFO: Blas vendor: [OPENBLAS]
package test;
import org.junit.Test;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.api.shape.Shape;
import org.nd4j.linalg.convolution.Convolution;
import org.nd4j.linalg.factory.Nd4j;
public class testBenchmark {
private final int[] inputSize = new int[]{ 1046, 446, 16 };
private final int[] outputSize = new int[]{ 1042, 442, 32 };
private final int[] kernelSize = new int[]{ 5, 5 };
private final int[] strideSize = new int[]{ 1, 1 };
private final int[] weightShape = new int[]{ 5, 5, 16, 32 };
private final int[] biasShape = new int[]{ 1, 32 };
private final int outW = outputSize[0];
private final int outH = outputSize[1];
private final int outDepth = outputSize[2];
private final int inDepth = inputSize[2];
private final int sW = strideSize[0];
private final int sH = strideSize[1];
private final int kW = kernelSize[0];
private final int kH = kernelSize[1];
private final INDArray weights = Nd4j.ones(weightShape);
private final INDArray bias = Nd4j.ones(biasShape);
private final VERSION version = VERSION.V091;
public enum VERSION {
V091, V100BETA2
}
@Test
public void testConv2DLayer() {
for (int i = 0; i < 1000; i++) {
// input
INDArray input = Nd4j.zeros(inputSize);
input = input.reshape(new int[]{ inputSize[0], inputSize[1], inputSize[2], 1 });
input.permutei(3, 2, 1, 0);
// weights
INDArray reshapedW = weights.reshape('f', kW * kH * inDepth, outDepth);
// im2col
INDArray col = Nd4j.createUninitialized(new int[]{ 1, outH, outW, inDepth, kH, kW }, 'c');
INDArray col2 = col.permute(0, 3, 4, 5, 1, 2);
long im2col1 = System.currentTimeMillis();
if (version == VERSION.V100BETA2) {
Convolution.im2col(input, kH, kW, sH, sW, 0, 0, 1, 1, false, col2);
} else {
Convolution.im2col(input, kH, kW, sH, sW, 0, 0, false, col2);
}
long im2col2 = System.currentTimeMillis();
INDArray im2col2d =
Shape.newShapeNoCopy(col, new int[]{ 1 * outH * outW, inDepth * kH * kW }, false);
// matrix multiplication
long mmul1 = System.currentTimeMillis();
INDArray output = im2col2d.mmul(reshapedW);
long mmul2 = System.currentTimeMillis();
// bias
output.addiRowVector(bias);
System.out.println("im2col used " + (im2col2 - im2col1) + "mmul used " + (mmul2 - mmul1));
}
}
}
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>simple-test</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<!-- nd4j 0.9.1 -->
<!-- <dependency> -->
<!-- <groupId>org.nd4j</groupId> -->
<!-- <artifactId>nd4j-native-platform</artifactId> -->
<!-- <version>0.9.1</version> -->
<!-- </dependency> -->
<!-- <dependency> -->
<!-- <groupId>org.slf4j</groupId> -->
<!-- <artifactId>slf4j-jdk14</artifactId> -->
<!-- <version>1.7.10</version> -->
<!-- </dependency> -->
<!-- nd4j 1.0.0-beta2 -->
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>1.0.0-beta4</version>
<exclusions>
<exclusion>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>mkl-dnn-platform</artifactId>
</exclusion>
<exclusion>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>mkl-platform</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-jdk14</artifactId>
<version>1.7.21</version>
</dependency>
</dependencies>
</project>
Please try again with 1.0.0-beta4.
Yeah, we're on 1.0.0-beta4 now
There's no point benchmarking an old version (except to compare against the latest).
We benchmarked and optimized a number of operations in beta4, and added MKL-DNN support.
@saudet @AlexDBlack,
Thanks you for the suggestion. I was choosing 1.0.0-beta2 because I cannot disable mkl in 1.0.0-beta4 with the flag -Dorg.bytedeco.javacpp.openblas.nomkl=true which is mentioned in https://github.com/bytedeco/javacpp-presets/blob/master/CHANGELOG.md
So my benchmark result will be on different BLAS if using 1.0.0-beta4. Was I missing other configurations to force choosing OpenBLAS?
Unit test
@Before
public void prepare() {
System.out.println("org.bytedeco.javacpp.openblas.nomkl=" + System.getProperty("org.bytedeco.javacpp.openblas.nomkl"));
}
Printed result
May 15, 2019 5:38:06 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 5:38:07 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 5:38:07 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Cores: [4]; Memory: [1.7GB];
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Blas vendor: [MKL]
org.bytedeco.javacpp.openblas.nomkl=true
That property has been enhanced to support other generic BLAS libraries.
Use -Dorg.bytedeco.openblas.load=openblas instead.
@saudet, thank you for the help. With the configuration -Dorg.bytedeco.openblas.load=openblas, the test now is running on OpenBLAS and the benchmark result is good.
Btw, I have another question and I am not sure if I can put the question here. If not, I will create a new issue.
The unit test depended on an uber jar which was built from nd4j-native-platform v0.9.1. The uber jar was created by following the similar shading process here https://github.com/deeplearning4j/dl4j-examples/blob/master/standalone-sample-project/pom.xml which is to shade all the files of nd4j-native-platform into one uber jar.
In nd4j both 0.9.1 and 1.0.0-beta2, the nd4j environment could be initialized successfully with one shaded uber jar. However, in 1.0.0-beta4, the unit test throws below errors while trying to initialize nd4j-native-platform from the uber jar v1.0.0-beta4 which is also built by maven-shade-plugin.
java.lang.ExceptionInInitializerError
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:88)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5858)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5766)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:202)
at test.testBenchmark.<init>(testBenchmark.java:27)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217)
at org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: http://nd4j.org/getstarted.html
at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:68)
at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(NativeOpsHolder.java:36)
... 32 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcpu in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1336)
at org.bytedeco.javacpp.Loader.load(Loader.java:1077)
at org.bytedeco.javacpp.Loader.load(Loader.java:947)
at org.nd4j.nativeblas.Nd4jCpu.<clinit>(Nd4jCpu.java:10)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.bytedeco.javacpp.Loader.load(Loader.java:1006)
at org.bytedeco.javacpp.Loader.load(Loader.java:947)
at org.nd4j.nativeblas.Nd4jCpu$NativeOps.<clinit>(Nd4jCpu.java:457)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:46)
... 33 more
Caused by: java.lang.UnsatisfiedLinkError: /home/user1/.javacpp/cache/nd4j-uberjar-1.0.0-beta4.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so: libmkldnn.so.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1316)
... 44 more
libnd4j now also links with MKL-DNN by default so I'm afraid this is a new required dependency, unless you're on a platform where MKL-DNN isn't supported like ARM or POWER. It's always possible to create custom builds from source without MKL-DNN though.
Most helpful comment
@saudet, thank you for the help. With the configuration
-Dorg.bytedeco.openblas.load=openblas, the test now is running on OpenBLAS and the benchmark result is good.The average processing time of im2col is 266.496 ms (std 18ms)
The average processing time of mmul is 201.33 ms (std 25ms)
The average processing time of im2col is 193.304 ms (std 18ms)
The average processing time of mmul is 191.008 ms (std 16ms)
Btw, I have another question and I am not sure if I can put the question here. If not, I will create a new issue.
Question
The unit test depended on an uber jar which was built from nd4j-native-platform v0.9.1. The uber jar was created by following the similar shading process here https://github.com/deeplearning4j/dl4j-examples/blob/master/standalone-sample-project/pom.xml which is to shade all the files of nd4j-native-platform into one uber jar.
In nd4j both 0.9.1 and 1.0.0-beta2, the nd4j environment could be initialized successfully with one shaded uber jar. However, in 1.0.0-beta4, the unit test throws below errors while trying to initialize nd4j-native-platform from the uber jar v1.0.0-beta4 which is also built by
maven-shade-plugin.