In the Python wrappers, even if I set the number of threads to 1, if colsample_bytree
is not 1, I get reproducible but different answers on OSX and in a linux VM. This is the code to reproduce.
import os
import pandas as pd
from xgboost import XGBClassifier
url = "http://goo.gl/j0Rvxq"
dta = pd.read_csv(url, header=None)
y = dta.pop(dta.columns[-1])
os.environ["OMP_NUM_THREADS"] = "1"
xgb = XGBClassifier(n_estimators=2, seed=1, nthread=1, colsample_bytree=.25)
xgb.fit(dta, y)
xgb.predict_proba(dta)
OSX:
array([[ 0.51461989, 0.48538011],
[ 0.57365119, 0.42634878],
[ 0.459674 , 0.540326 ],
...,
[ 0.55308044, 0.44691953],
[ 0.54698753, 0.4530125 ],
[ 0.55835366, 0.44164631]], dtype=float32)
Linux:
array([[ 0.50747007, 0.49252993],
[ 0.56461108, 0.43538889],
[ 0.49435556, 0.50564444],
...,
[ 0.54864073, 0.45135927],
[ 0.51801658, 0.48198345],
[ 0.55179334, 0.44820669]], dtype=float32)
You are using colsample_bytree
, it brings randomness during training.
Indeed, but I think the randomness should be reproducible if I set a seed.
It depends. You don't know how the system implement random number generator. On the same system, seed is able to ensure same random number sequence, but cross system it may not.
Why not use a random number generator that does ensure reproducibility across platforms?
The PRNG's implementation do not necessarily to be the same across systems.
I think we might want to switch to c++11's PRNG when the compilation option was on, and hopefully the underlying algorithm will be the same across the system.
FWIW, I think numpy uses the mersenne twister provided by randomkit [1, 2]. I don't know what R does, but I assume they're using some version of the mersenne twister too [3]. There may be some more forward-looking alternatives [4, 5]. I'm not sure about whether the new mersenne twister will the same across c++11 libraries.
I wish I could pitch in here, but I'd have a lot of getting up to speed to do to do it right.
[1] http://js2007.free.fr/code/index.html#RandomKit
[2] https://github.com/numpy/numpy/tree/master/numpy/random
[3] http://thread.gmane.org/gmane.comp.python.numeric.general/57465
[4] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
[5] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/
C++11's random should be mersene twister. I think we want to try to minimize the dependency on additional libraries, so I hope we do not introduce a PRNG dep here.
In R the random seed was simply redirected to R's internal random. I do not know if that was possible in python(i.e. re-used what numpy or python's random uses)
I don't think being able to pass the seed to R or Python/NumPy will solve the problem. I just meant how they solve having cross-platform compatible RNGs.
You could ship the new PRNG if the c++11 random stuff turns out to be platform or c++11 library-dependent. I've seen conflicting information about this. It doesn't appear to be a lot of code. In any case, I think having cross-platform deterministic results is pretty important.
Is the possible "simple" fix to change rand -> std::mt19937
if c++11 is being used?
I think we should try that, there is a MACRO guard that allows us to detect existence of c++11
This is now fixed by the new refactor #736
:+1:
Hello, I used pip to install xgboost 0.6a2 and run the script below in python. When colsample_bytree is less than 1, the resulting probabilities are close but different between Linux and Mac (if equal to 1, everything is perfect). Maybe I'm doing something wrong or the problem remains, thanks for your answer and for your advice?
import xgboost as xgb_real
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dtrain = xgb_real.DMatrix(X_train, label=y_train)
dtest = xgb_real.DMatrix(X_test, label=y_test)
param = {
'max_depth': 3,
'eta': 0.3,
'silent': 1,
'objective': 'multi:softprob',
'subsample': 1,
'colsample_bytree': 0.5,
'seed': 12,
'num_class': 3}
num_round = 20
bst = xgb_real.train(param, dtrain, num_round)
preds = bst.predict(dtest)
print(list(preds[:2]))
# Mac : [array([ 0.00745611, 0.95884031, 0.03370364], dtype=float32),
# array([ 0.98071623, 0.01367668, 0.00560714], dtype=float32)]
# Linux : [array([ 0.00905498, 0.96994174, 0.02100325], dtype=float32),
# array([ 0.97895575, 0.01572219, 0.00532213], dtype=float32)]
Most helpful comment
This is now fixed by the new refactor #736