Lightgbm: MAE custom loss function

Created on 22 Apr 2019 · 15Comments · Source: microsoft/LightGBM

Hello,

I'm trying to replicate MAE loss using custom objective function. Apparently it doesn't work. I've tried both hessian of zeros and ones. Could you please help me with this?

def my_loss(preds, dtrain):
    y_true = dtrain.get_label()
    d = (preds - y_true)
    grad = np.sign(d)
    hess = np.ones(preds.shape) 

    return grad, hess

metrics = []
for i in my_cv:
    X_train = X.loc[i[0],:]
    y_train = y.loc[i[0]]
    X_test = X.loc[i[1],:]
    y_test = y.loc[i[1]]


    dtrain = xgb.Dataset(X_train, label=y_train, free_raw_data =False)


    params = {'max_depth': 10, 'learning_rate':0.05,'objective':None,
             'num_leaves':150, 'min_child_samples':5, 'nround':1,
             'monotone_constraints':lst_mon}

    mm = xgb.train(params, dtrain, fobj = my_loss)
    y_pred = mm.predict(X_train)

Source

Sirorezka

👍1

Most helpful comment

@Sirorezka For MAE in LightGBM, the source code is here is you want to know exactly how LightGBM performs MAE with the exact math formulas required (both unweighted and weighted cases):

https://github.com/microsoft/LightGBM/blob/1c27a15e42f0076492fcc966b9dbcf9da6042823/src/objective/regression_objective.hpp#L199-L216

Laurae2 on 8 May 2019

👍2

All 15 comments

sorry, firstly, is that lightgbm? (not xgb?)
I think nround=1 with learning_rate=0.05 cannot work, even for other objective functions.
LightGBM also has built-in MAE loss, and perform better than custom one in most cases.

guolinke on 23 Apr 2019

👍1

Sorry, I was experimenting with nround. What is the best learning rate? I know that lightgbm has it's own MAE, but I'm trying to replicate it and then futher apply some changes to it. Do you know what could be the problem?

Sirorezka on 23 Apr 2019

@Sirorezka ensure that nrounds * learning_rate at least great than 1.
there is not an optimal learning rate.

guolinke on 23 Apr 2019

👍1

It also does not work on my code

def smape_loss(preds, train_data):
    print('************************************')
    print(preds)
    labels = train_data.get_label()
    grad = np.zeros(shape=len(preds), dtype=np.float64)
    for x in range(len(preds)):
        if preds[x] >= labels[x]:
            grad[x] = 1 / labels[x]
        else:
            grad[x] = - 1 / labels[x]
    hess = np.zeros(shape=len(preds), dtype=np.float64)
    print(grad)
    print(hess)
    return grad, hess


params = {
    'boosting': 'gbdt',
    'objective': 'binary',
    'num_leaves': 64,
    'min_child_weight': 1.1,
    'max_depth': 6,
    'lambda': 10,
    'metric': 'l1,l2',
    'subsample': 0.75,
    'colsample_bytree': 0.75,
    'num_iterations':4000,
    'eta': 10,
    'n_jobs': 10,
    'seed': 0,
    'verbose': -1
}

bst = lgb.train(params, train_set, fobj=smape_loss, feval=smape, valid_sets=[valid_set], verbose_eval=10, early_stopping_rounds=100)

[0. 0. 0. ... 0. 0. 0.]
[-0.5 -0.5 -1. ... -0.00892857 -0.16666667
-0.16666667]
[0. 0. 0. ... 0. 0. 0.]

The preds always 0.

huaileiseu on 3 May 2019

@huaileiseu hessian cannot be zeros.

guolinke on 3 May 2019

@guolinke thanks

huaileiseu on 4 May 2019

I'm not all that familiar with lightgbm but I've been playing around with quantile regression in catboost (quantile regression is a general form of MAE using alpha=0.5), see http://jmarkhou.com/lgbqr/

Most high performance boosted tree algorithms use the approximation L'/L'' to approximate the leaf values and L'^2/L'' for the split gain as mentioned in the link

But L' is constant and L'' is 0 for MAE so this doesn't work, you need to fall back to another option (catboost for example uses the standard gradient descent which is slower as it requires the loss function to be evaluation multiple times) but only requires L' (but since this is a constant my observation is it needs quite a few leaf iterations for some problems). The link above explains what lightgbm does for quantile regression which I assume is also used for MAE but I don't really know.

With catboost you can specify which model to use to calculate the leaf estimations (newton or gradient descent). Not sure if lightgbm has this, it may help?

david-waterworth on 4 May 2019

@david-waterworth you can set L'' to ones, and that equals to gradient descent.

guolinke on 4 May 2019

It also does not work on my code and there exists no zero in hessian. The preds always 0 yet.Have someone successfully used fobj？

def smape_obj(preds, dtrain):
    labels = dtrain.get_label()
    print(preds)
    diff = preds - labels
    epsilon = 0.0001
    pl_mn = np.where(diff >= 0, 1, -1)
    grad = np.divide(labels*2+epsilon,np.power(preds+labels+epsilon,2))
    hess = np.divide(-labels*4-epsilon*2,np.power(preds+labels+epsilon,3))
    grad = pl_mn*grad
    hess = pl_mn*hess
    print(hess)
    return grad, hess

clf = lgb.train(param, trn_data, num_round, valid_sets=[trn_data, val_data], verbose_eval=1, early_stopping_rounds=500, categorical_feature = cate,fobj = smape_obj, feval = smape_eval)

QAZASDEDC on 8 May 2019

@Sirorezka For MAE in LightGBM, the source code is here is you want to know exactly how LightGBM performs MAE with the exact math formulas required (both unweighted and weighted cases):

https://github.com/microsoft/LightGBM/blob/1c27a15e42f0076492fcc966b9dbcf9da6042823/src/objective/regression_objective.hpp#L199-L216

Laurae2 on 8 May 2019

👍2

@Laurae2 As you can see from my code I'm utilizing hess=1. But loss function doesn't work as it supposed to do. I believe that custom loss functions doesn't work if they are not differentiated twice.

Sirorezka on 13 May 2019

@Laurae2 @Sirorezka
For these objective function with first_order_gradient is constant, LightGBM has a special treatment for them:
https://github.com/microsoft/LightGBM/blob/master/src/objective/regression_objective.hpp#L235-L265

It will use the constant gradient for the tree structure learning, but use the residual for the leaf output calculation, with percentile function, e.g. 50% for MAE. This solution is from sklearn, and is proven to work in many benchmarks.

However, such a solution is not easy to write by custom loss function in Python, as you need re-assign the leaf output at each iteration.

guolinke on 14 May 2019

@QAZASDEDC
it is possible that you have very small gradient or very large hessians, or sum_gradient/sum_hessian closes to zeros.

guolinke on 14 May 2019

It also does not work on my code

def smape_loss(preds, train_data):
    print('************************************')
    print(preds)
    labels = train_data.get_label()
    grad = np.zeros(shape=len(preds), dtype=np.float64)
    for x in range(len(preds)):
        if preds[x] >= labels[x]:
            grad[x] = 1 / labels[x]
        else:
            grad[x] = - 1 / labels[x]
    hess = np.zeros(shape=len(preds), dtype=np.float64)
    print(grad)
    print(hess)
    return grad, hess


params = {
    'boosting': 'gbdt',
    'objective': 'binary',
    'num_leaves': 64,
    'min_child_weight': 1.1,
    'max_depth': 6,
    'lambda': 10,
    'metric': 'l1,l2',
    'subsample': 0.75,
    'colsample_bytree': 0.75,
    'num_iterations':4000,
    'eta': 10,
    'n_jobs': 10,
    'seed': 0,
    'verbose': -1
}

bst = lgb.train(params, train_set, fobj=smape_loss, feval=smape, valid_sets=[valid_set], verbose_eval=10, early_stopping_rounds=100)

[0. 0. 0. ... 0. 0. 0.]
[-0.5 -0.5 -1. ... -0.00892857 -0.16666667
-0.16666667]
[0. 0. 0. ... 0. 0. 0.]

The preds always 0.

It also does not work on my code

def smape_loss(preds, train_data):
    print('************************************')
    print(preds)
    labels = train_data.get_label()
    grad = np.zeros(shape=len(preds), dtype=np.float64)
    for x in range(len(preds)):
        if preds[x] >= labels[x]:
            grad[x] = 1 / labels[x]
        else:
            grad[x] = - 1 / labels[x]
    hess = np.zeros(shape=len(preds), dtype=np.float64)
    print(grad)
    print(hess)
    return grad, hess


params = {
    'boosting': 'gbdt',
    'objective': 'binary',
    'num_leaves': 64,
    'min_child_weight': 1.1,
    'max_depth': 6,
    'lambda': 10,
    'metric': 'l1,l2',
    'subsample': 0.75,
    'colsample_bytree': 0.75,
    'num_iterations':4000,
    'eta': 10,
    'n_jobs': 10,
    'seed': 0,
    'verbose': -1
}

bst = lgb.train(params, train_set, fobj=smape_loss, feval=smape, valid_sets=[valid_set], verbose_eval=10, early_stopping_rounds=100)

[0. 0. 0. ... 0. 0. 0.]
[-0.5 -0.5 -1. ... -0.00892857 -0.16666667
-0.16666667]
[0. 0. 0. ... 0. 0. 0.]

The preds always 0.

so how do you do to make it work?