Lightgbm: [feature requests] support utf-8 characters in feature name

Created on 30 Sep 2019 · 26Comments · Source: microsoft/LightGBM

Do not support non-ascii characters in feature name ?
Could you please consider backward compatibility ?

I use xgboost and catboost and sklearn at the same time, only lightgbm has encoding compatibility problems...

thx

feature request

Source

OnlyFor

👍8

All 26 comments

I guess the feature_name function in xgb/cat is maintained in python-side, so it is easy for utf8 encoding. But this requires the different implementation in each language package, and different model save/load solution.

In LightGBM, it is maintained in cpp side, and save in model file, and thus is hard for utf8.

If we want to support the utf-8 feature name, the model save/load logic may change, and cause more backward compatibility problems.

guolinke on 30 Sep 2019

👍1

A workaround is to save an additional file for features name, and force its name to ,+".fn". And that the encodnig of that file could be utf-8, and autoload by python/R itself when loading the model file, not by cpp.

guolinke on 30 Sep 2019

Could lightgbm automatically replace and restore utf8 feature names in python/R side, before and after cpp part ? or maintain feature transformation in python/R, support utf8 indirectly ? transformation dict can also written to model file

OnlyFor on 30 Sep 2019

@OnlyFor I am not familiar with the string encoding, but I think that is not trivial.
Maybe we can use something like base64 to decode and encode for feature names.

guolinke on 30 Sep 2019

@guolinke Encoding feature names will hurt the model file readability for humans, I guess.

StrikerRUS on 2 Oct 2019

@guolinke Should this issue be included in #2302?

StrikerRUS on 9 Oct 2019

https://github.com/dmlc/xgboost/pull/4937#issuecomment-541330515.

StrikerRUS on 13 Oct 2019

@guolinke WDYT https://github.com/microsoft/LightGBM/issues/2478#issuecomment-540021056?

StrikerRUS on 6 Nov 2019

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS on 11 Nov 2019

Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.

PointCloudNiphon on 14 Nov 2019

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

StrikerRUS on 14 Nov 2019

How can I solve this?

rajibrj43 on 15 Nov 2019

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

import numpy as np
import lightgbm as lgb

feature_names_from_comment = "| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause"
feature_names = [i.strip() for i in feature_names_from_comment.split('|') if i]

X = np.random.random((100, len(feature_names)))
y = np.random.random((100,))

lgb.LGBMRegressor().fit(X, y, feature_name=feature_names)

StrikerRUS on 15 Nov 2019

Hello,

I have noticed this issue recently and I think the current behavior is not great,
howerver, I also agree with https://github.com/microsoft/LightGBM/issues/2478#issuecomment-536375659 and https://github.com/microsoft/LightGBM/issues/2226#issuecomment-502113232.

So for now, I create and put a work-around for Python.

import types

# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).

henry0312 on 18 Nov 2019

ok,I WILL HAVE A TRY,THANKS.

发自我的小米手机
在 OMOTO Tsukasa notifications@github.com，2019年11月18日下午3:07写道：

So for now, I create and put a work-around for Python.

import types

gbm is an instance of LGBMModel.

you have feature_names

gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)

NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/microsoft/LightGBM/issues/2478?email_source=notifications&email_token=AI2M6S7G522XOXGZAMD6Z4DQUI5MLA5CNFSM4I3VEC32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJNDYI#issuecomment-554881505, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI2M6SZ57WGAIKA64FQFLOTQUI5MLANCNFSM4I3VEC3Q.

PointCloudNiphon on 18 Nov 2019

👍1

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

@StrikerRUS : This bug is more frequent than you seem to realize. It arises not because of original database columns / feature names being non-ASCII, but from the use of one-hot encoder (e.g get_dummies() from pandas), which appends non-ASCII feature levels (variables values) to these ASCII column names. So now after such encoding even categorical feature levels cannot contain regional characters... and they usually do (outside of the US). Your closest competitor, XGBoost does not impose such arbitrary and US-centric restrictions.

mirekphd on 28 Jan 2020

👍1

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

Because you used random numbers for all variables, including Name:) Try some non-ASCII characters with one-hot encoder... @StrikerRUS, once you replicate, could you please /reopen and repair?

mirekphd on 28 Jan 2020

@mirekphd I see, get_dummies() adds some headache and requires one more manual step for renaming column names. But please note that LightGBM doesn't require one-hot encoding for categorical variables and normally you won't use that function during a preprocessing phase: https://lightgbm.readthedocs.io/en/latest/Quick-Start.html#categorical-feature-support.

Speaking about reopening, please refer to https://github.com/microsoft/LightGBM/issues/2478#issuecomment-552484900.

StrikerRUS on 29 Jan 2020

@henry0312 Can we mark this issue as resolved via #2976 or is it better to wait @jameslamb's PR for R part?

StrikerRUS on 10 Apr 2020

I believe that we can but It's better for us to wait for passing R tests because I'm not an expert in R.

henry0312 on 10 Apr 2020

👍1

I already created #2983 to capture the R-specific work @henry0312

jameslamb on 10 Apr 2020

Thank you @jameslamb ! Sorry, I didn't notice it. Then I'm putting a tick in our feature requests hub for this issue, because model file supports UTF-8 after #2976. R-specific progress will be tracked in your separate issue.

StrikerRUS on 11 Apr 2020

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_cv, label=y_cv)

param = {'objective': 'regression',
'boosting': 'gbdt',
'metric': 'l2_root',
'learning_rate': 0.05,
'num_iterations': 350,
'num_leaves': 31,
'max_depth': -1,
'min_data_in_leaf': 15,
'bagging_fraction': 0.85,
'bagging_freq': 1,
'feature_fraction': 0.55
}

lgbm = lgb.train(params=param,
verbose_eval=50,
train_set=train_data,
valid_sets=[test_data])

y_pred_lgbm = lgbm.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_lgbm))))

i'm getting error:
LightGBMError: Do not support non-ASCII characters in feature name.

Shubhammishra-21 on 27 Apr 2020

@Shubhammishra-21 please use the latest master branch.

guolinke on 27 Apr 2020

Recode variables( names), with tildes, ...
symbols that are not allowed in the American language

mercedesmedaly on 19 Jul 2020

I have got the error LightGBMError: Do not support special JSON characters in feature name. when using LightGBM 3.0 on a Windows 10 machine. It seems that issues with special characters were fixed with this release, but perhaps not on windows?