Deeplearning4j: Question of benchmark between two versions

Created on 15 May 2019  路  6Comments  路  Source: eclipse/deeplearning4j

Issue Description

This is not a bug, but I noticed that the processing time of Convolution.im2col() and a.mmul(b) in nd4j-1.0.0-beta2 is longer than the processing time in nd4j-0.9.1.
The tests were performed using below ndarrays on the same hardware. I read this page https://deeplearning4j.org/docs/latest/deeplearning4j-config-performance-debugging, but I still have no idea how to deal with this. If this is not a proper place to ask a question, please suggest where to post my question. Appreciate that.

Convolutional2DLayer with below arrays

  • input size [1046, 446, 16]
  • output size [1042, 442, 32]
  • weight size [5, 5, 16, 32]

Processing time of running 1000 cycles in Convolutional2DLayer.

  • In nd4j native platform 0.9.1
    The average processing time of im2col is 266.496 ms (std 18ms)
    The average processing time of mmul is 201.33 ms (std 25ms)
  • In nd4j native platform 1.0.0-beta2
    The average processing time of im2col is 384.456 ms (std 43ms)
    The average processing time of mmul is 222.959 ms (std 38ms)

Version Information

The hardware is the same.

  • Log in nd4j native platform 0.9.1
May 15, 2019 12:49:21 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 12:49:21 PM org.reflections.Reflections scan
INFO: Reflections took 165 ms to scan 64 urls, producing 40181 keys and 43883 values 
May 15, 2019 12:49:21 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 12:49:21 PM org.reflections.Reflections scan
INFO: Reflections took 137 ms to scan 11 urls, producing 31 keys and 227 values 
May 15, 2019 12:49:21 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 12:49:21 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
INFO: Cores: [4]; Memory: [1.7GB];
INFO: Blas vendor: [OPENBLAS]
  • Log in nd4j native platform 1.0.0-beta2
May 15, 2019 12:33:58 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 12:33:59 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 12:33:59 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 12:33:59 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
INFO: Cores: [4]; Memory: [1.7GB];
INFO: Blas vendor: [OPENBLAS]

Testing Code

package test;

import org.junit.Test;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.api.shape.Shape;
import org.nd4j.linalg.convolution.Convolution;
import org.nd4j.linalg.factory.Nd4j;

public class testBenchmark {

    private final int[] inputSize = new int[]{ 1046, 446, 16 };
    private final int[] outputSize = new int[]{ 1042, 442, 32 };
    private final int[] kernelSize = new int[]{ 5, 5 };
    private final int[] strideSize = new int[]{ 1, 1 };
    private final int[] weightShape = new int[]{ 5, 5, 16, 32 };
    private final int[] biasShape = new int[]{ 1, 32 };
    private final int outW = outputSize[0];
    private final int outH = outputSize[1];
    private final int outDepth = outputSize[2];
    private final int inDepth = inputSize[2];
    private final int sW = strideSize[0];
    private final int sH = strideSize[1];
    private final int kW = kernelSize[0];
    private final int kH = kernelSize[1];
    private final INDArray weights = Nd4j.ones(weightShape);
    private final INDArray bias = Nd4j.ones(biasShape);
    private final VERSION version = VERSION.V091;

    public enum VERSION {
        V091, V100BETA2
    }

    @Test
    public void testConv2DLayer() {
        for (int i = 0; i < 1000; i++) {
            // input
            INDArray input = Nd4j.zeros(inputSize);
            input = input.reshape(new int[]{ inputSize[0], inputSize[1], inputSize[2], 1 });
            input.permutei(3, 2, 1, 0);

            // weights
            INDArray reshapedW = weights.reshape('f', kW * kH * inDepth, outDepth);

            // im2col
            INDArray col = Nd4j.createUninitialized(new int[]{ 1, outH, outW, inDepth, kH, kW }, 'c');
            INDArray col2 = col.permute(0, 3, 4, 5, 1, 2);
            long im2col1 = System.currentTimeMillis();
            if (version == VERSION.V100BETA2) {
                Convolution.im2col(input, kH, kW, sH, sW, 0, 0, 1, 1, false, col2);
            } else {
                Convolution.im2col(input, kH, kW, sH, sW, 0, 0, false, col2);
            }
            long im2col2 = System.currentTimeMillis();
            INDArray im2col2d =
                Shape.newShapeNoCopy(col, new int[]{ 1 * outH * outW, inDepth * kH * kW }, false);

            // matrix multiplication
            long mmul1 = System.currentTimeMillis();
            INDArray output = im2col2d.mmul(reshapedW);
            long mmul2 = System.currentTimeMillis();

            // bias
            output.addiRowVector(bias);
            System.out.println("im2col used " + (im2col2 - im2col1) + "mmul used " + (mmul2 - mmul1));
        }
    }
}

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>test</groupId>
    <artifactId>simple-test</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>

        <!-- nd4j 0.9.1 -->
        <!-- <dependency> -->
        <!-- <groupId>org.nd4j</groupId> -->
        <!-- <artifactId>nd4j-native-platform</artifactId> -->
        <!-- <version>0.9.1</version> -->
        <!-- </dependency> -->
        <!-- <dependency> -->
        <!-- <groupId>org.slf4j</groupId> -->
        <!-- <artifactId>slf4j-jdk14</artifactId> -->
        <!-- <version>1.7.10</version> -->
        <!-- </dependency> -->

        <!-- nd4j 1.0.0-beta2 -->
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-native-platform</artifactId>
            <version>1.0.0-beta4</version>
            <exclusions>
                <exclusion>
                    <groupId>org.bytedeco.javacpp-presets</groupId>
                    <artifactId>mkl-dnn-platform</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.bytedeco.javacpp-presets</groupId>
                    <artifactId>mkl-platform</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-jdk14</artifactId>
            <version>1.7.21</version>
        </dependency>
    </dependencies>
</project>
ND4J Question

Most helpful comment

@saudet, thank you for the help. With the configuration -Dorg.bytedeco.openblas.load=openblas, the test now is running on OpenBLAS and the benchmark result is good.

  • In nd4j native platform 0.9.1
    The average processing time of im2col is 266.496 ms (std 18ms)
    The average processing time of mmul is 201.33 ms (std 25ms)
  • In nd4j native platform 1.0.0-beta4
    The average processing time of im2col is 193.304 ms (std 18ms)
    The average processing time of mmul is 191.008 ms (std 16ms)

Btw, I have another question and I am not sure if I can put the question here. If not, I will create a new issue.

Question

The unit test depended on an uber jar which was built from nd4j-native-platform v0.9.1. The uber jar was created by following the similar shading process here https://github.com/deeplearning4j/dl4j-examples/blob/master/standalone-sample-project/pom.xml which is to shade all the files of nd4j-native-platform into one uber jar.

In nd4j both 0.9.1 and 1.0.0-beta2, the nd4j environment could be initialized successfully with one shaded uber jar. However, in 1.0.0-beta4, the unit test throws below errors while trying to initialize nd4j-native-platform from the uber jar v1.0.0-beta4 which is also built by maven-shade-plugin.

java.lang.ExceptionInInitializerError
    at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:88)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5858)
    at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5766)
    at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:202)
    at test.testBenchmark.<init>(testBenchmark.java:27)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217)
    at org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:263)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: http://nd4j.org/getstarted.html
    at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:68)
    at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(NativeOpsHolder.java:36)
    ... 32 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcpu in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
    at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    at java.lang.System.loadLibrary(System.java:1122)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1336)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1077)
    at org.bytedeco.javacpp.Loader.load(Loader.java:947)
    at org.nd4j.nativeblas.Nd4jCpu.<clinit>(Nd4jCpu.java:10)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1006)
    at org.bytedeco.javacpp.Loader.load(Loader.java:947)
    at org.nd4j.nativeblas.Nd4jCpu$NativeOps.<clinit>(Nd4jCpu.java:457)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:46)
    ... 33 more
Caused by: java.lang.UnsatisfiedLinkError: /home/user1/.javacpp/cache/nd4j-uberjar-1.0.0-beta4.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so: libmkldnn.so.0: cannot open shared object file: No such file or directory
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
    at java.lang.Runtime.load0(Runtime.java:809)
    at java.lang.System.load(System.java:1086)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1316)
    ... 44 more

All 6 comments

Please try again with 1.0.0-beta4.

Yeah, we're on 1.0.0-beta4 now
There's no point benchmarking an old version (except to compare against the latest).
We benchmarked and optimized a number of operations in beta4, and added MKL-DNN support.

@saudet @AlexDBlack,

Thanks you for the suggestion. I was choosing 1.0.0-beta2 because I cannot disable mkl in 1.0.0-beta4 with the flag -Dorg.bytedeco.javacpp.openblas.nomkl=true which is mentioned in https://github.com/bytedeco/javacpp-presets/blob/master/CHANGELOG.md
So my benchmark result will be on different BLAS if using 1.0.0-beta4. Was I missing other configurations to force choosing OpenBLAS?

Unit test

    @Before
    public void prepare() {
        System.out.println("org.bytedeco.javacpp.openblas.nomkl=" + System.getProperty("org.bytedeco.javacpp.openblas.nomkl"));
    }

Printed result

May 15, 2019 5:38:06 PM org.nd4j.linalg.factory.Nd4jBackend load
INFO: Loaded [CpuBackend] backend
May 15, 2019 5:38:07 PM org.nd4j.nativeblas.NativeOpsHolder <init>
INFO: Number of threads used for NativeOps: 4
May 15, 2019 5:38:07 PM org.nd4j.nativeblas.Nd4jBlas <init>
INFO: Number of threads used for BLAS: 4
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Backend used: [CPU]; OS: [Linux]
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Cores: [4]; Memory: [1.7GB];
May 15, 2019 5:38:07 PM org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner printEnvironmentInformation
INFO: Blas vendor: [MKL]
org.bytedeco.javacpp.openblas.nomkl=true

That property has been enhanced to support other generic BLAS libraries.
Use -Dorg.bytedeco.openblas.load=openblas instead.

@saudet, thank you for the help. With the configuration -Dorg.bytedeco.openblas.load=openblas, the test now is running on OpenBLAS and the benchmark result is good.

  • In nd4j native platform 0.9.1
    The average processing time of im2col is 266.496 ms (std 18ms)
    The average processing time of mmul is 201.33 ms (std 25ms)
  • In nd4j native platform 1.0.0-beta4
    The average processing time of im2col is 193.304 ms (std 18ms)
    The average processing time of mmul is 191.008 ms (std 16ms)

Btw, I have another question and I am not sure if I can put the question here. If not, I will create a new issue.

Question

The unit test depended on an uber jar which was built from nd4j-native-platform v0.9.1. The uber jar was created by following the similar shading process here https://github.com/deeplearning4j/dl4j-examples/blob/master/standalone-sample-project/pom.xml which is to shade all the files of nd4j-native-platform into one uber jar.

In nd4j both 0.9.1 and 1.0.0-beta2, the nd4j environment could be initialized successfully with one shaded uber jar. However, in 1.0.0-beta4, the unit test throws below errors while trying to initialize nd4j-native-platform from the uber jar v1.0.0-beta4 which is also built by maven-shade-plugin.

java.lang.ExceptionInInitializerError
    at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:88)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5858)
    at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5766)
    at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:202)
    at test.testBenchmark.<init>(testBenchmark.java:27)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217)
    at org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:263)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: http://nd4j.org/getstarted.html
    at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:68)
    at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(NativeOpsHolder.java:36)
    ... 32 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcpu in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
    at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    at java.lang.System.loadLibrary(System.java:1122)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1336)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1077)
    at org.bytedeco.javacpp.Loader.load(Loader.java:947)
    at org.nd4j.nativeblas.Nd4jCpu.<clinit>(Nd4jCpu.java:10)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.bytedeco.javacpp.Loader.load(Loader.java:1006)
    at org.bytedeco.javacpp.Loader.load(Loader.java:947)
    at org.nd4j.nativeblas.Nd4jCpu$NativeOps.<clinit>(Nd4jCpu.java:457)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:46)
    ... 33 more
Caused by: java.lang.UnsatisfiedLinkError: /home/user1/.javacpp/cache/nd4j-uberjar-1.0.0-beta4.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so: libmkldnn.so.0: cannot open shared object file: No such file or directory
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
    at java.lang.Runtime.load0(Runtime.java:809)
    at java.lang.System.load(System.java:1086)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1316)
    ... 44 more

libnd4j now also links with MKL-DNN by default so I'm afraid this is a new required dependency, unless you're on a platform where MKL-DNN isn't supported like ARM or POWER. It's always possible to create custom builds from source without MKL-DNN though.

Was this page helpful?
0 / 5 - 0 ratings