Lightgbm: [feature requests] support utf-8 characters in feature name

Created on 30 Sep 2019  ·  26Comments  ·  Source: microsoft/LightGBM

Do not support non-ascii characters in feature name ?
Could you please consider backward compatibility ?

I use xgboost and catboost and sklearn at the same time, only lightgbm has encoding compatibility problems...

thx

feature request

All 26 comments

I guess the feature_name function in xgb/cat is maintained in python-side, so it is easy for utf8 encoding. But this requires the different implementation in each language package, and different model save/load solution.

In LightGBM, it is maintained in cpp side, and save in model file, and thus is hard for utf8.

If we want to support the utf-8 feature name, the model save/load logic may change, and cause more backward compatibility problems.

A workaround is to save an additional file for features name, and force its name to ,+".fn". And that the encodnig of that file could be utf-8, and autoload by python/R itself when loading the model file, not by cpp.

Could lightgbm automatically replace and restore utf8 feature names in python/R side, before and after cpp part ? or maintain feature transformation in python/R, support utf8 indirectly ? transformation dict can also written to model file

@OnlyFor I am not familiar with the string encoding, but I think that is not trivial.
Maybe we can use something like base64 to decode and encode for feature names.

@guolinke Encoding feature names will hurt the model file readability for humans, I guess.

@guolinke Should this issue be included in #2302?

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

 Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.
Bellow are my features. I can not find the ASCII character in my feature.
| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause

How can I solve this?

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

import numpy as np
import lightgbm as lgb

feature_names_from_comment = "| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause"
feature_names = [i.strip() for i in feature_names_from_comment.split('|') if i]

X = np.random.random((100, len(feature_names)))
y = np.random.random((100,))

lgb.LGBMRegressor().fit(X, y, feature_name=feature_names)

Hello,

I have noticed this issue recently and I think the current behavior is not great,
howerver, I also agree with https://github.com/microsoft/LightGBM/issues/2478#issuecomment-536375659 and https://github.com/microsoft/LightGBM/issues/2226#issuecomment-502113232.

So for now, I create and put a work-around for Python.

import types

# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).

ok,I WILL HAVE A TRY,THANKS.

发自我的小米手机
在 OMOTO Tsukasa notifications@github.com,2019年11月18日 下午3:07写道:

So for now, I create and put a work-around for Python.

import types

gbm is an instance of LGBMModel.

you have feature_names

gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)

NOTICE: pickle can't dump lambda, so you can use dill or cloudpickle

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/microsoft/LightGBM/issues/2478?email_source=notifications&email_token=AI2M6S7G522XOXGZAMD6Z4DQUI5MLA5CNFSM4I3VEC32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJNDYI#issuecomment-554881505, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI2M6SZ57WGAIKA64FQFLOTQUI5MLANCNFSM4I3VEC3Q.

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

@StrikerRUS : This bug is more frequent than you seem to realize. It arises not because of original database columns / feature names being non-ASCII, but from the use of one-hot encoder (e.g get_dummies() from pandas), which appends non-ASCII feature levels (variables values) to these ASCII column names. So now after such encoding even categorical feature levels cannot contain regional characters... and they usually do (outside of the US). Your closest competitor, XGBoost does not impose such arbitrary and US-centric restrictions.

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

Because you used random numbers for all variables, including Name:) Try some non-ASCII characters with one-hot encoder... @StrikerRUS, once you replicate, could you please /reopen and repair?

@mirekphd I see, get_dummies() adds some headache and requires one more manual step for renaming column names. But please note that LightGBM doesn't require one-hot encoding for categorical variables and normally you won't use that function during a preprocessing phase: https://lightgbm.readthedocs.io/en/latest/Quick-Start.html#categorical-feature-support.

Speaking about reopening, please refer to https://github.com/microsoft/LightGBM/issues/2478#issuecomment-552484900.

@henry0312 Can we mark this issue as resolved via #2976 or is it better to wait @jameslamb's PR for R part?

I believe that we can but It's better for us to wait for passing R tests because I'm not an expert in R.

I already created #2983 to capture the R-specific work @henry0312

Thank you @jameslamb ! Sorry, I didn't notice it. Then I'm putting a tick in our feature requests hub for this issue, because model file supports UTF-8 after #2976. R-specific progress will be tracked in your separate issue.

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_cv, label=y_cv)

param = {'objective': 'regression',
'boosting': 'gbdt',
'metric': 'l2_root',
'learning_rate': 0.05,
'num_iterations': 350,
'num_leaves': 31,
'max_depth': -1,
'min_data_in_leaf': 15,
'bagging_fraction': 0.85,
'bagging_freq': 1,
'feature_fraction': 0.55
}

lgbm = lgb.train(params=param,
verbose_eval=50,
train_set=train_data,
valid_sets=[test_data])

y_pred_lgbm = lgbm.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_lgbm))))

i'm getting error:
LightGBMError: Do not support non-ASCII characters in feature name.

@Shubhammishra-21 please use the latest master branch.

Recode variables( names), with tildes, ...
symbols that are not allowed in the American language

I have got the error LightGBMError: Do not support special JSON characters in feature name. when using LightGBM 3.0 on a Windows 10 machine. It seems that issues with special characters were fixed with this release, but perhaps not on windows?

Was this page helpful?
0 / 5 - 0 ratings