Xgboost: [jvm-package] xgboost for JVM has test failures on OS X

Created on 17 Jun 2017  Â·  20Comments  Â·  Source: dmlc/xgboost

I can build the regular R, C and python packages just fine. Even the first test cases from the JVM based build i.e. where the DMatrix is tested are green. However with rabit/JNI on spark
https://gist.github.com/geoHeil/bc88c2b849eca875e580b8ff170fd598 I see only JNI error messages.

Environment info

Operating System: mac osx 10.12.5

Compiler: gcc7

Package used (python/R/jvm/C++): JVM

xgboost version used: current master branch

If installing from source, please provide

  1. The commit hash (git rev-parse HEAD) cd7659937b2c6a4a82988a72761a7f21d9b53743
  2. Logs will be helpful (If logs are large, please upload as attachment). https://gist.github.com/geoHeil/bc88c2b849eca875e580b8ff170fd598

If you are using jvm package, please

gcc --version                                                                                                                                        [±master ✓]
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.1.0 (clang-802.0.42)
Target: x86_64-apple-darwin16.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Steps to reproduce

  1. checkout latest xgboost from master branch
  2. build the JVM package

Most helpful comment

my error is fixed with the patch suggested by @superbobry and it works in spark as well.

thank you sooooo much!

All 20 comments

@CodingCat again a problem with rabit on OS X. Any thoughts on this? The regular JNI test cases succeeds, only when spark/rabit is involved these problems occur.

Tracker started, with env={} still the issue with network address binding

I do not have bandwidth to work on it for now.....

https://github.com/dmlc/xgboost/issues/1004 suggest:

RabitTracker call Runtime to exec a command like "python ..." which is depended on env variable PATH. If there is an exception or an error, the return of getEnv() will be empty. Setting the correct python version via add python path to the beginning of PATH fixes this issues.

Maybe using the experimental Scala rabit implementation will help out here.

I am not 100% sure but can you check that your hostname resolves to 127.0.0.1 in /etc/hosts?

@superbobry

hostname                                                                  ✹ ✭
Georgs-MacBook-Pro.local

and the hosts file

% cat /etc/hosts                                                            ✹ ✭
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255 broadcasthost
::1             localhost

# https://github.com/docker/compose/issues/3419
# /etc/hosts
127.0.0.1 localunixsocket.local

Is this what you mean?

edit

However, nslookup will fail for nslookup Georgs-MacBook-Pro.local

@superbobry unfortunately,

127.0.0.1        Georgs-MacBook-Pro.local

adapting the hosts file does not fix the problem.

Okay, could you try compiling with clang instead of gcc7? I imagine you're overriding CC/CXX with GCC, right?

Indeed, I am compiling with:

export CC=gcc-7
export CXX=g++-7

which settings would you suggest here?

edit

when unset CC; unsetCXX is applied, I see the same errors.

I suggest just to use the OS X defaults. It should build fine, but the resulting binary wouldn't have OMP support (hence single-thread only).

Update: sorry, didn't spot your edit. Could you also remove xgboost/build directory to make sure you'd build from scratch?

Why should changing the compiler fix the RABIT networking issues?

Still failing. However, I observed a different error message this time:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.5.160, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=8}
17/07/03 13:40:32 INFO RabitTracker$TrackerProcessLogger: 2017-07-03 13:40:32,249 WARNING gethostbyname(socket.getfqdn()) failed... trying on hostname()

Do you have any other error messages in the output?

Why should changing the compiler fix the RABIT networking issues?

It shouldn't, but I've observed the same error locally with GCC7 while clang build worked fine. Also, Travis is able to build&test xgboost4j using clang.

Intersting:

17/07/03 05:37:44 INFO RabitTracker$TrackerProcessLogger: 2017-07-03 05:37:44,489 WARNING gethostbyname(socket.getfqdn()) failed... trying on hostname()

is displayed on travis as well but does not show any problem afterwards: https://travis-ci.org/dmlc/xgboost/jobs/249494987#L2249

clang did not help me.

Could you confirm that you're having exactly the same issue as before?

Please see https://gist.github.com/geoHeil/c7a67b31b1f5b3eb390b35008f552855#file-errors-txt-L1110 for yourself

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.0.18, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=8}
Check failed: base_score > 0.0f && base_score < 1.0f base_score must be in (0,1) for logistic loss

And then the usual exception of: at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48) as the source.

I am using Ubuntu 16.04
And I have the same error.
Check failed: base_score > 0.0f && base_score < 1.0f base_score must be in (0,1) for logistic loss.

Any way to fix it?

You can manually apply the patch in dmlc/dmlc-core#351.

Hi,
I have the same error too: https://gist.github.com/mizotm/914e146538c5720885e6e854eb97f07e
I'm on Ubuntu 16.04. The fix suggested by @superbobry allows the tests to run, but the scala code doesn't compile later with the fix.

my error is fixed with the patch suggested by @superbobry and it works in spark as well.

thank you sooooo much!

@superbobry Sorry I was mistaken, your fix does actually work. Thanks a lot!

I want to point out that fixing this issue doesn't require applying not yet merged patch. Original patch issue highlights that root cause is locale-dependent code for parameter parsing. 0.5 is parsed as 0 because input is expected to be 0,5 under certain locales (for example, russian). You can avoid this error by enforcing en_US locale (especially LC_NUMERIC):

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Build will pass fine with such settings. In my case locale slipped in on another host through ssh session because of SendEnv LANG LC_* setting in /etc/ssh/ssh_config

Was this page helpful?
0 / 5 - 0 ratings