Xgboost: Is Normalization necessary?

Created on 17 Jun 2015 · 3Comments · Source: dmlc/xgboost

I am not quite sure how xgboost works in theory. But since xgboost is a tree based classifier, is it ok to assume that no normalization of features is needed?

Source

frankzhangrui

Most helpful comment

no you do not have to normalize the features

tqchen on 17 Jun 2015

👍5

All 3 comments

no you do not have to normalize the features

tqchen on 17 Jun 2015

👍5

I think I understand that in principle there is no need for normalization when boosting trees.

However, one can see quite some impact when scaling the target y, especially with 'reg:gamma', but also (to a lesser extent) for 'reg:linear' (the default). What is the reason for this?

Example for the Boston Housing dataset:

import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

boston = load_boston()
y = boston['target']
X = boston['data']

for scale in np.logspace(-6, 6, 7):
    xgb_model = xgb.XGBRegressor().fit(X, y / scale)
    predictions = xgb_model.predict(X) * scale
    print('{} (scale={})'.format(mean_squared_error(y, predictions), scale))

2.3432734454908335 (scale=1e-06)
2.343273977065266 (scale=0.0001)
2.3432793874455315 (scale=0.01)
2.290595204136888 (scale=1.0)
2.528513393507719 (scale=100.0)
7.228978353091473 (scale=10000.0)
272.29640759874474 (scale=1000000.0)

The impact of scaling y is really big when using 'reg:gamma':

for scale in np.logspace(-6, 6, 7):
    xgb_model = xgb.XGBRegressor(objective='reg:gamma').fit(X, y / scale)
    predictions = xgb_model.predict(X) * scale
    print('{} (scale={})'.format(mean_squared_error(y, predictions), scale))

591.6509503519147 (scale=1e-06)
545.8298971540023 (scale=0.0001)
37.68688286293508 (scale=0.01)
4.039819858716935 (scale=1.0)
2.505477263590776 (scale=100.0)
198.94093800190453 (scale=10000.0)
592.1469169959003 (scale=1000000.0)