Deeplearning4j: Keras model import is unstable

Created on 4 Apr 2017  路  11Comments  路  Source: eclipse/deeplearning4j

Keras model importer crashes due to issues in libhdf5.so

We have used Keras model importer tools to import Inception-V3 model to DL4J.
It works most of the times, however, occasionally JVM crashes due to issue in the native code.

Out of 100 times to test the model-import and to classify a test image, DL4J crashed 6 times for the following reason:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fddefb988e4, pid=15318, tid=0x00007fddf410d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libhdf5.so.100+0x1138e4]  H5FL_reg_free+0x84
#
# Core dump written. Default location: /home/tg/work/projects/apache/tika/core or core.15318
#
# An error report file with more information is saved as:
# /home/tg/work/projects/apache/tika/hs_err_pid15318.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

By setting a system property -Dorg.bytedeco.javacpp.nopointergc=true, the crash can be avoided, however it results in memory leak and demands excessive memory.

Details to reproduce the bug are here https://github.com/apache/tika/pull/165#issuecomment-291394161 and https://github.com/apache/tika/pull/165#issuecomment-291402383

Version Information

  • Deeplearning4j version : 0.8.0
  • platform information :
$ uname -a
Linux hackb0x 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
  • CUDA version : Not used
  • NVIDIA driver version: Not used
Bug

All 11 comments

This is most likely caused by an object from HDF5 being garbage collected prematurely.

thanks for checking into this @saudet and for reporting @thammegowda

I found one place in the code that might cause this kind of error to happen, fixed that in the commit above, but I would need to test this some more. Would you have an easy way to reproduce that outside of Tika?

@saudet Thanks.
Yes, there is a simpler way::

# get this code and model
git clone [email protected]:USCDataScience/dl4j-kerasimport-examples.git
cd dl4j-kerasimport-examples/dl4j-import-example/
# here you may have to edit pom.xml to set dl4j version to SNAPSHOT
# build
mvn clean package
# get a sample image
wget http://www.lamorindabaseball.org/wp-content/uploads/2015/09/Trojans3.jpg
# run a test 
alias imagerec="java -Xmx400m -jar target/dl4j-keras-imports-example-1.0-SNAPSHOT-jar-with-dependencies.jar"
imagerec Trojans3.jpg

## Now test it in a loop 
for i in {1..100}; do echo $i; echo "==$i==" >> out; imagerec Trojans3.jpg >> out 2>&1; done
echo "JVM crashed `ls hs_err_pid*.log| wc -l` times"

P.S.
observation: JVM crashes more often in Linux than in OSX. So it seems a lot easier to reproduce in Linux-x86_64 than in OSX

Great, thanks! With 0.8.0 it crashes right away, but with 0.8.1-SNAPSHOT, which includes the changes in the commit above, I couldn't get it to crash even once in 150 executions. I'm considering this fixed!

@saudet this is awesome. Thanks.

FRICKIN AWESOME. Thanks @thammegowda and @saudet !

Hi!

I think I've just got bitten by this bug. I've got over 700 hdf5 models that I try to import and I get a big ugly core dump after a few hundreds (see attachment).

C  [libhdf5.100.dylib+0x1d363c]  H5SL_search+0xb5c
C  [libhdf5.100.dylib+0x126028]  H5I_object+0x68
C  [libhdf5.100.dylib+0xebcf8]  H5G_loc+0x208
C  [libhdf5.100.dylib+0xddd34]  H5Gget_info+0x94
C  [libhdf5_cpp.100.dylib+0x2204f]  H5::CommonFG::getNumObjs() const+0x1f
C  [libjnihdf5.dylib+0x2b167]  Java_org_bytedeco_javacpp_hdf5_00024CommonFG_getNumObjs+0x47
J 3150  org.bytedeco.javacpp.hdf5$CommonFG.getNumObjs()J (0 bytes) @ 0x00000001159db3a6 [0x00000001159db300+0xa6]

hs_err_pid65474.log

@alexweil Try again with DL4J 1.0.0-beta, this should be fixed. If not, please open another issue.

thanks @saudet
it works ok in 1.0.0-beta

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlexDBlack picture AlexDBlack  路  5Comments

tintinxue1 picture tintinxue1  路  5Comments

AlexanderSerbul picture AlexanderSerbul  路  3Comments

atuzhykov picture atuzhykov  路  4Comments

turambar picture turambar  路  3Comments