Xgboost: Random Number not reproducible across different OSs

Created on 13 May 2015 · 13Comments · Source: dmlc/xgboost

In the Python wrappers, even if I set the number of threads to 1, if colsample_bytree is not 1, I get reproducible but different answers on OSX and in a linux VM. This is the code to reproduce.

import os

import pandas as pd
from xgboost import XGBClassifier

url = "http://goo.gl/j0Rvxq"
dta = pd.read_csv(url, header=None)
y = dta.pop(dta.columns[-1])

os.environ["OMP_NUM_THREADS"] = "1"
xgb = XGBClassifier(n_estimators=2, seed=1, nthread=1, colsample_bytree=.25)
xgb.fit(dta, y)
xgb.predict_proba(dta)

OSX:

array([[ 0.51461989,  0.48538011],
       [ 0.57365119,  0.42634878],
       [ 0.459674  ,  0.540326  ],
       ...,
       [ 0.55308044,  0.44691953],
       [ 0.54698753,  0.4530125 ],
       [ 0.55835366,  0.44164631]], dtype=float32)

Linux:

array([[ 0.50747007,  0.49252993],
       [ 0.56461108,  0.43538889],
       [ 0.49435556,  0.50564444],
       ...,
       [ 0.54864073,  0.45135927],
       [ 0.51801658,  0.48198345],
       [ 0.55179334,  0.44820669]], dtype=float32)

Source

jseabold

Most helpful comment

This is now fixed by the new refactor #736

tqchen on 15 Jan 2016

👍2

All 13 comments

You are using colsample_bytree, it brings randomness during training.

antinucleon on 13 May 2015

Indeed, but I think the randomness should be reproducible if I set a seed.

jseabold on 13 May 2015

It depends. You don't know how the system implement random number generator. On the same system, seed is able to ensure same random number sequence, but cross system it may not.

antinucleon on 13 May 2015

Why not use a random number generator that does ensure reproducibility across platforms?

jseabold on 13 May 2015

The PRNG's implementation do not necessarily to be the same across systems.

I think we might want to switch to c++11's PRNG when the compilation option was on, and hopefully the underlying algorithm will be the same across the system.

tqchen on 13 May 2015

FWIW, I think numpy uses the mersenne twister provided by randomkit [1, 2]. I don't know what R does, but I assume they're using some version of the mersenne twister too [3]. There may be some more forward-looking alternatives [4, 5]. I'm not sure about whether the new mersenne twister will the same across c++11 libraries.

I wish I could pitch in here, but I'd have a lot of getting up to speed to do to do it right.

[1] http://js2007.free.fr/code/index.html#RandomKit
[2] https://github.com/numpy/numpy/tree/master/numpy/random
[3] http://thread.gmane.org/gmane.comp.python.numeric.general/57465
[4] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
[5] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/

jseabold on 13 May 2015

C++11's random should be mersene twister. I think we want to try to minimize the dependency on additional libraries, so I hope we do not introduce a PRNG dep here.
In R the random seed was simply redirected to R's internal random. I do not know if that was possible in python(i.e. re-used what numpy or python's random uses)

tqchen on 13 May 2015

I don't think being able to pass the seed to R or Python/NumPy will solve the problem. I just meant how they solve having cross-platform compatible RNGs.

You could ship the new PRNG if the c++11 random stuff turns out to be platform or c++11 library-dependent. I've seen conflicting information about this. It doesn't appear to be a lot of code. In any case, I think having cross-platform deterministic results is pretty important.

jseabold on 14 May 2015

👍1

Is the possible "simple" fix to change rand -> std::mt19937 if c++11 is being used?

jseabold on 1 Jun 2015

I think we should try that, there is a MACRO guard that allows us to detect existence of c++11

tqchen on 2 Jun 2015

This is now fixed by the new refactor #736

tqchen on 15 Jan 2016

👍2

:+1:

jseabold on 15 Jan 2016

Hello, I used pip to install xgboost 0.6a2 and run the script below in python. When colsample_bytree is less than 1, the resulting probabilities are close but different between Linux and Mac (if equal to 1, everything is perfect). Maybe I'm doing something wrong or the problem remains, thanks for your answer and for your advice?

import xgboost as xgb_real
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dtrain = xgb_real.DMatrix(X_train, label=y_train)
dtest = xgb_real.DMatrix(X_test, label=y_test)

param = {
    'max_depth': 3,
    'eta': 0.3,
    'silent': 1,
    'objective': 'multi:softprob',
    'subsample': 1,
    'colsample_bytree': 0.5,
    'seed': 12,
    'num_class': 3}

num_round = 20 

bst = xgb_real.train(param, dtrain, num_round)
preds = bst.predict(dtest)

print(list(preds[:2]))

# Mac : [array([ 0.00745611,  0.95884031,  0.03370364], dtype=float32),
# array([ 0.98071623,  0.01367668,  0.00560714], dtype=float32)]

# Linux : [array([ 0.00905498,  0.96994174,  0.02100325], dtype=float32),
# array([ 0.97895575,  0.01572219,  0.00532213], dtype=float32)]