Random crashes while loading libmxnet-scala.so with both Mxnet 1.2.0 and 1.3.0
Mxnet 1.3.0
CentOS Linux release 7.4.1708
JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
Package used:
Scala
For Scala user, please provide:
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
b3be92f4a48bce62a5a8424271871c2f81c8f7f1
Build config:
default config.mk
[2018-10-10 20:46:29,040] INFO - pool-1-thread-1 - NativeLibraryLoader - - Loading libmxnet-scala.so from /lib/native/ copying to mxnet-scala
A fatal error has been detected by the Java Runtime Environment:
SIGILL (0x4) at pc=0x00007fe87774dff4, pid=4035, tid=0x00007fe93039d700
JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [mxnet-scala+0x341aff4] mxnet::op::OperatorTune<mshadow::half::half_t>::Initialize()+0x2c4
Core dump written. Default location: /xyz/core or core.4035
An error report file with more information is saved as:
/xyz/hs_err_pid4035.log
Random crash while loading libmxnet-scala.so
This crash occurs while running Scala tests.
(Paste the commands you ran that produced the error.)
1.
2.
We are trying to reproduce this issue with a testcase.
Thank you for reporting this issue. We will look into it.
@mxnet-label-bot [Scala, Breaking]
@sameer-kapoor Thanks for your question. Please provide some detailed information on the crash message. Have you tried to build from source or use the jar package on Maven?
@mxnet-label-bot [Pending Requester Info]
@lanking520 We build from source using SBT and on the same OS versions as where the tests are run.
We haven't tried the Maven jar package since we use Scala 2.12.
As far as the crash goes there is nothing else available to us other than what I have posted. If you'd like like I can post the thread dump from the hs_err_pid.log.
@sameer-kapoor Could you please bring the steps you have taken to build from source?
Note: We haven't tested the Scala package build on Cent OS. We do a test on Ubuntu 16.04 and above.
Here's a reference of the Scala Package building steps :
https://github.com/apache/incubator-mxnet/tree/master/scala-package#build
@sameer-kapoor Can you try modifying the MSHADOW Flags as mentioned here before building MXNet from source?
https://philipskokoh.github.io/blog/mxnet-on-centos
As a pre-requisite, can you also install opencv-devel and atlas-devel before running the make scalapkg command ?
Let mw know if this works for you.
Hello, to build the C++ library, we simply run 'make' on a CentOS 7 machine. We build the Scala package using a custom SBT script so that we can create artifacts for Scala 2.12. We will run some tests on a sample project using Scala 2.11 to see if the same issue arises when we run make scalapkg. That way we can rule out whether there's an issue with our Scala build process.
As to @piyushghai 's last comment,opencv-devel and atlas-devel are already installed on the machine. Additionally, modification of the mshadow linker flags is no longer needed. I submitted a change a while back that sets those flags automatically for CentOS machines, and I can see during the build process that the flags are indeed set.
Without those packages and linker flags, there would be a linker failure. Currently, the entire compilation and linkage completes successfully.
@sameer-kapoor @milandesai We are trying to reproduce the issue you may facing by the steps came from the CI:
bash ci/docker/install/centos7_core.sh
yum install -y maven
yum install -y java-1.8.0-openjdk
And build the MXNet backend with the following command:
https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L239-L253
We tested with python install and it succeeds with:
pip install -U -e .
However, Scala contains linking issues:
undefined symbol: _ZN2ps4Meta6kEmptyE (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: __gcov_merge_add (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8Customer11WaitRequestEi (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10PostofficeC1Ev (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10Postoffice18GetServerKeyRangesEv (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: __gcov_init (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8CustomerC1EiiRKSt8functionIFvRKNS_7MessageEEE (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8Customer10NewRequestEi (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10Postoffice12GetDeadNodesEi (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8Customer11NumResponseEi (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8Customer11AddResponseEii (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10Postoffice8FinalizeEib (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10Postoffice5StartEiPKcb (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps8CustomerD1Ev (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps10Postoffice7BarrierEii (./libmxnet-init-scala-linux-x86_64.so)
undefined symbol: _ZN2ps3Van4SendERKNS_7MessageE (./libmxnet-init-scala-linux-x86_64.so)
From the community feedback from @szha , ps-lite and test-coverage dependencies are missing. We are diagnosing the issue and see if we can get it work.
In the meantime, can you check with
ldd -r libmxnet.so
To see if there is any linking problem?
Hi @lanking520 , I ran the ldd command and there was no linking issue. Also, I simplified our set up to rule out any issues with our specific application logic and custom Scala 2.12 build process. To do this, I created a simple Scala 2.11 project that creates and prints out a small NDArray. The MXNet library for CentOS was built by installing opencv-devel and atlas-devel on a CentOS 7 VM followed by running make and make scalapkg. Here are our observations:
Given that the SIGINT occurs only in our Docker container and not on a VM, we are examining the image to see if any environment-specific thing could be triggering the error
@milandesai thanks for your reply. I am now aware of what's going on. It seemed like the crashes are random and SIGINT came from the backend. To track the performance of CentOS to ensure its compatibility, we will try to get Scala Build for CentOS on CI. Hope this would help to improve the stability on Scala package for future release.
Since the issue is with a docker, could you please try with ubuntu:16.04 build to see if you can reproduce the similar issues? Scala package does have a better support there.
Hi @lanking520 , issue was not reproduced on ubuntu build. I have another update. We isolated a specific physical node on which the crash was occurring. The crash only occurs when Kubernetes Jenkins runs the application on a pod on that specific CentOS 7 node, hence the intermittency. When it does run on that node, MXNet crashes every time. At first glance, there is no difference between the faulty node and other nodes on which the job succeeds, but a more in-depth investigation may reveal something. Unfortunately, getting access to that node will not be easy due to security polices, but I will try my best to figure out why MXNet doesn't work there so that a bug or restriction can be documented for future users. In the meantime, we are unblocked by simply not running the CI job on that node. Thanks!
Since we are unblocked and can't provide details about the faulty node in the short-term, feel free to close this ticket
Thanks @milandesai for your investigation here. Happy to know that you are unblocked by it.
@sandeep-krishnamurthy Please close this issue as indicated by Milan here.
Nice!! Thanks for your contribution to the community and we will ship CentOS build/test very soon 馃憤