Xgboost: Random Number not reproducible across different OSs

Created on 13 May 2015  路  13Comments  路  Source: dmlc/xgboost

In the Python wrappers, even if I set the number of threads to 1, if colsample_bytree is not 1, I get reproducible but different answers on OSX and in a linux VM. This is the code to reproduce.

import os

import pandas as pd
from xgboost import XGBClassifier

url = "http://goo.gl/j0Rvxq"
dta = pd.read_csv(url, header=None)
y = dta.pop(dta.columns[-1])

os.environ["OMP_NUM_THREADS"] = "1"
xgb = XGBClassifier(n_estimators=2, seed=1, nthread=1, colsample_bytree=.25)
xgb.fit(dta, y)
xgb.predict_proba(dta)

OSX:

array([[ 0.51461989,  0.48538011],
       [ 0.57365119,  0.42634878],
       [ 0.459674  ,  0.540326  ],
       ...,
       [ 0.55308044,  0.44691953],
       [ 0.54698753,  0.4530125 ],
       [ 0.55835366,  0.44164631]], dtype=float32)

Linux:

array([[ 0.50747007,  0.49252993],
       [ 0.56461108,  0.43538889],
       [ 0.49435556,  0.50564444],
       ...,
       [ 0.54864073,  0.45135927],
       [ 0.51801658,  0.48198345],
       [ 0.55179334,  0.44820669]], dtype=float32)

Most helpful comment

This is now fixed by the new refactor #736

All 13 comments

You are using colsample_bytree, it brings randomness during training.

Indeed, but I think the randomness should be reproducible if I set a seed.

It depends. You don't know how the system implement random number generator. On the same system, seed is able to ensure same random number sequence, but cross system it may not.

Why not use a random number generator that does ensure reproducibility across platforms?

The PRNG's implementation do not necessarily to be the same across systems.

I think we might want to switch to c++11's PRNG when the compilation option was on, and hopefully the underlying algorithm will be the same across the system.

FWIW, I think numpy uses the mersenne twister provided by randomkit [1, 2]. I don't know what R does, but I assume they're using some version of the mersenne twister too [3]. There may be some more forward-looking alternatives [4, 5]. I'm not sure about whether the new mersenne twister will the same across c++11 libraries.

I wish I could pitch in here, but I'd have a lot of getting up to speed to do to do it right.

[1] http://js2007.free.fr/code/index.html#RandomKit
[2] https://github.com/numpy/numpy/tree/master/numpy/random
[3] http://thread.gmane.org/gmane.comp.python.numeric.general/57465
[4] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
[5] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/

C++11's random should be mersene twister. I think we want to try to minimize the dependency on additional libraries, so I hope we do not introduce a PRNG dep here.
In R the random seed was simply redirected to R's internal random. I do not know if that was possible in python(i.e. re-used what numpy or python's random uses)

I don't think being able to pass the seed to R or Python/NumPy will solve the problem. I just meant how they solve having cross-platform compatible RNGs.

You could ship the new PRNG if the c++11 random stuff turns out to be platform or c++11 library-dependent. I've seen conflicting information about this. It doesn't appear to be a lot of code. In any case, I think having cross-platform deterministic results is pretty important.

Is the possible "simple" fix to change rand -> std::mt19937 if c++11 is being used?

I think we should try that, there is a MACRO guard that allows us to detect existence of c++11

This is now fixed by the new refactor #736

:+1:

Hello, I used pip to install xgboost 0.6a2 and run the script below in python. When colsample_bytree is less than 1, the resulting probabilities are close but different between Linux and Mac (if equal to 1, everything is perfect). Maybe I'm doing something wrong or the problem remains, thanks for your answer and for your advice?

import xgboost as xgb_real
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dtrain = xgb_real.DMatrix(X_train, label=y_train)
dtest = xgb_real.DMatrix(X_test, label=y_test)

param = {
    'max_depth': 3,
    'eta': 0.3,
    'silent': 1,
    'objective': 'multi:softprob',
    'subsample': 1,
    'colsample_bytree': 0.5,
    'seed': 12,
    'num_class': 3}

num_round = 20 

bst = xgb_real.train(param, dtrain, num_round)
preds = bst.predict(dtest)

print(list(preds[:2]))

# Mac : [array([ 0.00745611,  0.95884031,  0.03370364], dtype=float32),
# array([ 0.98071623,  0.01367668,  0.00560714], dtype=float32)]

# Linux : [array([ 0.00905498,  0.96994174,  0.02100325], dtype=float32),
# array([ 0.97895575,  0.01572219,  0.00532213], dtype=float32)]
Was this page helpful?
0 / 5 - 0 ratings

Related issues

yananchen1989 picture yananchen1989  路  3Comments

wenbo5565 picture wenbo5565  路  3Comments

nicoJiang picture nicoJiang  路  4Comments

trivialfis picture trivialfis  路  3Comments

uasthana15 picture uasthana15  路  4Comments