Train on apache hadoop yarn takes more time as the worker >=2 in configuration
It seems that the train speed is slower as the worker(node) in configuration changed from 1 to 2 or more.Can anyone tell me why this happens, is there anything wrong with my configuration?
The configuration:
booster = gbtree
objective = multi:softmax
eta = 0.5
max_depth = 5
num_class = 10
num_round = 50
save_period = 0
eval_train = 1
The Shell Script:
../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=4 --worker-cores=2
../../xgboost parameter.conf nthread=16\
data=hdfs://hadoop01:8020/xgb-demo/train\
eval[test]=hdfs://hadoop01:8020/xgb-demo/test\
model_dir=hdfs://hadoop01:8020/xgb-demo/model
Hello @wallyell I'm facing the same problem here when trying to train a dataset in apache spark.
Have you found a solution for this?
Here it stops after reaching the line #156 of XGBoost
val returnVal = tracker.waitFor()`
The tracker seems to take too long for returning a value, my config params are:
"silent" -> 1,
"objective" -> "reg:linear",
"booster" -> "gbtree",
"eta" -> 0.0225,
"max_depth" -> 26,
"subsample" -> 0.63,
"colsample_btree" -> 0.63,
"min_child_weight" -> 9,
"gamma" -> 0,
"eval_metric" -> "rmse",
"tree_method" -> "auto"
It seems that, when using more than 1 worker, the connection to the RabitTracker doesn't work as required, it freeze.
Tested with booster gblinear but still freeze.
To reproduce this issue you only need to set up a java xgboost with 2 or more workers.
This issue can be reproduced using this test
Most helpful comment
Hello @wallyell I'm facing the same problem here when trying to train a dataset in apache spark.
Have you found a solution for this?
Here it stops after reaching the line #156 of XGBoost
val returnVal = tracker.waitFor()`
The tracker seems to take too long for returning a value, my config params are:
It seems that, when using more than 1 worker, the connection to the RabitTracker doesn't work as required, it freeze.
Tested with booster gblinear but still freeze.
To reproduce this issue you only need to set up a java xgboost with 2 or more workers.
This issue can be reproduced using this test