Deeplearning4j: DataVec Python: incorrect address returned for python object + no validation

Created on 20 Feb 2020  路  8Comments  路  Source: eclipse/deeplearning4j

So I'm trying to run the datavec-python tests on locally on Windows, in IntelliJ.
Happens for both CPU and CUDA.
Windows 10, Python 3.7.4 is also installed and on the Path via Anaconda (though happens without Anaconda installed too).
Perhaps it's a system configuration issue - I'm pretty sure I ran these tests OK on my old machine.

Running on branch sa/new-datatypes-pythonexecutioner - https://github.com/KonduitAI/deeplearning4j/pull/241

As far as I can tell, the python object (numpy array) is valid - note the [[1 1 1]\n [1 1 1]] in the debugger - but PyLong_AsLong(pyAddress) is returning an address of -1 for me.

Additionally this isn't picked up as an invalid address in the DataVec NumpyArray; the first time we see that a problem has occurred is the cudaMemcpyAsync from the invalid address for CUDA (or a JVM crash for CPU). We should have at least basic validation on the Java side.

image

Failed on [FFFFFFFFFFFFFFFF] -> [0000002A01EA2600], size: [48], direction: [1], dZ: [1]

java.lang.RuntimeException: cudaMemcpyAsync failed; Error code: 1

    at org.nd4j.jita.allocator.pointers.cuda.cudaStream_t.synchronize(cudaStream_t.java:43)
    at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:140)
    at org.nd4j.linalg.jcublas.buffer.CudaLongDataBuffer.<init>(CudaLongDataBuffer.java:49)
    at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.create(CudaDataBufferFactory.java:691)
    at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1127)
    at org.datavec.python.NumpyArray.setND4JArray(NumpyArray.java:94)
    at org.datavec.python.NumpyArray.<init>(NumpyArray.java:63)
    at org.datavec.python.NumpyArray.<init>(NumpyArray.java:82)
    at org.datavec.python.PythonObject.toNumpy(PythonObject.java:341)
    at org.datavec.python.PythonType$7.toJava(PythonType.java:208)
    at org.datavec.python.PythonType$7.toJava(PythonType.java:201)
    at org.datavec.python.PythonExecutioner.getVariable(PythonExecutioner.java:197)
    at org.datavec.python.PythonExecutioner.getVariables(PythonExecutioner.java:205)
    at org.datavec.python.PythonExecutioner.exec(PythonExecutioner.java:244)
    at org.datavec.python.Python.exec(Python.java:262)
    at org.datavec.python.TestPythonExecutioner.testNDArrayLong(TestPythonExecutioner.java:220)
Bug

All 8 comments

This is using ctypes, which isn't the most efficient API out there. If we use NumPy's C API everywhere, I don't think we'd need to use ctypes anyway?

So I've tried Java 8 (OpenJDK 8) in addition to Java 11 that I was using when I opened the issue.
Same result.

Tried a couple of other random things, all that didn't help:

  • Not inheriting the system environment variables (PATH etc)
  • Setting OMP_NUM_THREADS=8
  • Clearing JavaCPP cache

All result in the same -1 address and hence the JVM crash (CPU) or memcpy issue (CUDA).

So still not sure what's up here...

This may be Windows specific somehow.
The same machine (3990x) on Linux (Clear Linux 32370) runs the tests fine on CPU - no crashes, I'm getting a valid pointer address.

@farizrahman4u Can we try to replace ctypes with numpy everywhere?

@AlexDBlack can you pull sa/new-datatypes-pythonexecutioner and try again?

So, retesting this: I've run a bunch of times each on CPU and CUDA on windows, for branch sa/new-datatypes-pythonexecutioner - no crashes :tada:

I am however seeing a test failure on TestPythonExecutioner.testByteBufferInplace (IIRC Shams flagged this already; now that I can run without crashing I'll look into it)

sa/new-datatypes-pythonexecutioner isn't merged yet, but it's close... I'll close this issue as resolved.

Was this page helpful?
0 / 5 - 0 ratings