libhdf5.soWe have used Keras model importer tools to import Inception-V3 model to DL4J.
It works most of the times, however, occasionally JVM crashes due to issue in the native code.
Out of 100 times to test the model-import and to classify a test image, DL4J crashed 6 times for the following reason:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fddefb988e4, pid=15318, tid=0x00007fddf410d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libhdf5.so.100+0x1138e4] H5FL_reg_free+0x84
#
# Core dump written. Default location: /home/tg/work/projects/apache/tika/core or core.15318
#
# An error report file with more information is saved as:
# /home/tg/work/projects/apache/tika/hs_err_pid15318.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
By setting a system property -Dorg.bytedeco.javacpp.nopointergc=true, the crash can be avoided, however it results in memory leak and demands excessive memory.
Details to reproduce the bug are here https://github.com/apache/tika/pull/165#issuecomment-291394161 and https://github.com/apache/tika/pull/165#issuecomment-291402383
$ uname -a
Linux hackb0x 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
This is most likely caused by an object from HDF5 being garbage collected prematurely.
thanks for checking into this @saudet and for reporting @thammegowda
I found one place in the code that might cause this kind of error to happen, fixed that in the commit above, but I would need to test this some more. Would you have an easy way to reproduce that outside of Tika?
@saudet Thanks.
Yes, there is a simpler way::
# get this code and model
git clone [email protected]:USCDataScience/dl4j-kerasimport-examples.git
cd dl4j-kerasimport-examples/dl4j-import-example/
# here you may have to edit pom.xml to set dl4j version to SNAPSHOT
# build
mvn clean package
# get a sample image
wget http://www.lamorindabaseball.org/wp-content/uploads/2015/09/Trojans3.jpg
# run a test
alias imagerec="java -Xmx400m -jar target/dl4j-keras-imports-example-1.0-SNAPSHOT-jar-with-dependencies.jar"
imagerec Trojans3.jpg
## Now test it in a loop
for i in {1..100}; do echo $i; echo "==$i==" >> out; imagerec Trojans3.jpg >> out 2>&1; done
echo "JVM crashed `ls hs_err_pid*.log| wc -l` times"
P.S.
observation: JVM crashes more often in Linux than in OSX. So it seems a lot easier to reproduce in Linux-x86_64 than in OSX
Great, thanks! With 0.8.0 it crashes right away, but with 0.8.1-SNAPSHOT, which includes the changes in the commit above, I couldn't get it to crash even once in 150 executions. I'm considering this fixed!
@saudet this is awesome. Thanks.
FRICKIN AWESOME. Thanks @thammegowda and @saudet !
Hi!
I think I've just got bitten by this bug. I've got over 700 hdf5 models that I try to import and I get a big ugly core dump after a few hundreds (see attachment).
C [libhdf5.100.dylib+0x1d363c] H5SL_search+0xb5c
C [libhdf5.100.dylib+0x126028] H5I_object+0x68
C [libhdf5.100.dylib+0xebcf8] H5G_loc+0x208
C [libhdf5.100.dylib+0xddd34] H5Gget_info+0x94
C [libhdf5_cpp.100.dylib+0x2204f] H5::CommonFG::getNumObjs() const+0x1f
C [libjnihdf5.dylib+0x2b167] Java_org_bytedeco_javacpp_hdf5_00024CommonFG_getNumObjs+0x47
J 3150 org.bytedeco.javacpp.hdf5$CommonFG.getNumObjs()J (0 bytes) @ 0x00000001159db3a6 [0x00000001159db300+0xa6]
@alexweil Try again with DL4J 1.0.0-beta, this should be fixed. If not, please open another issue.
thanks @saudet
it works ok in 1.0.0-beta
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.