Lightgbm: [Feature] Let data inform node assignment of missing values

Created on 14 Dec 2016 · 28Comments · Source: microsoft/LightGBM

LightGBM does not yet use the training data to inform the way it handles missing values. Instead, it seems missing values are just treated as 0's, leading to worse predictions than other frameworks such as XGBoost that explicitly process missing values.

XGBoost's solution works quite well and the gist of it seems compatible with LightGBM's procedures, so I guess this would be a good start. The general idea is to alter the best threshold functions to run twice; once with missing values assigned to the left node, and once with missing values assigned to the right node, and keep track of the best missing values assignment for each node. More detailed information about XGBoost's implementation is available here: https://arxiv.org/pdf/1603.02754v3.pdf.

help wanted

Source

Allardvm

👍4

All 28 comments

@Allardvm
I think it is not hard to adapt this feature. Following are some key points:

the representation of missing value, maybe nan or na.
zero values in lightSVM file are missing or just zero ? or using parameter to specific this?
Add an specific bin for missing value in histogram
search best threshold two time if having missing value bin
how to predict with missing value? Maybe give missing value an specific value from training result?

Welcome to contribute to this feature. I think it may have more problems..
we can have more discussions during the development.
Thanks!

guolinke on 14 Dec 2016

Yes, it shouldn't be too hard to implement the algorithm itself, but it's gonna be tricky to avoid subtle bugs. As for your points:

I would say we should stick to NaN for now, and possibly add a parameter that specifies the value that should be interpreted as 'missing' at a later point.
As far as I know, svmlight does not support missing values. I suggest that we stick to the format's specifications and throw an error when encountering a NaN (or Inf, for that matter) in an svmlight file, as some other machine learning frameworks do as well (e.g. H20, see https://github.com/h2oai/h2o-2/wiki/Parser-Specification).
I'd say we should assign bin[0] to handle missing values since this is easiest to handle consistently across the program.
Yes, that seems like the best strategy. We would still have to decide how to handle the absence of missing data during training, because prediction data might still have missing values. Maybe we could assign missing data to the leaf that has the most entries? I haven't yet looked into how other gradient boosting frameworks handle this.
I would say we should leave data as missing and then use the data-informed decision to determine how we handle missing data, i.e. whether it should be treated as <= threshold or > threshold (or left/right leaf).

Some other points that should probably be tackled:

Add a bool missing_in_left; to SplitInfo that is true when missing values for a particular feature should be assigned to the left leaf and false when they should be assigned to the right leaf.
Adapt the dump and save model functions to include missing_in_left.
Adapt the model parsing function to include missing_in_left.

Allardvm on 15 Dec 2016

👍1

Moreover it seems that colums with NA + one value is considered as a constant and therefore dropped.
"[LightGBM] [Warning] Ignoring Column_XX , only has one value"
And therefore i have worse results as with XGBOOST.

One major tricks of XGBOOST is when you have a column with only 2 value (Sex: Man/Woman, or dummy a categorical feature but with no pre-existing NA) you can encode one of the two value as NA. XGBOOST handles it perfectly and the complexity by columns is reduced from nlog(n) to n_{not na}log(n_{not na}). Therefore there is a huge boost in speed without a loss of accuracy.

jacquespeeters on 22 Dec 2016

@guolinke What is the current state of missing values in LightGBM?:

For CLI?
For R?
For Python?

I've seen commits for it but I'm not sure what belongs to what currently.

Laurae2 on 13 May 2017

@Laurae2 It just let NA==0 for now.

guolinke on 13 May 2017

@guolinke I don't think it is good to treat zero values as missing values, because categorical features is often be dummied and represented 0 or 1. The zero doesn't mean missing.

henry0312 on 15 May 2017

👍1

@henry0312 I thought about that.

if dumping to 0/1. the missing value will be in one independent column (1 for missing, 0 for not missing)
for other 0/1 columns, the zero bins will only be represented as zero. (if treating zero to one, it will be only one bin, which is pointless.)

guolinke on 15 May 2017

@guolinke I'm sorry, I don't understand what you mean....

1 for missing, 0 for not missing

Why does 1 mean missing?
(I think dummy often do one-hot encoding)

To treat zero as zero after #516, do many users have to convert zero values manually so that thier absolute value are greater than 1e-20f?
I wonder if there is an another way to treat missing values (such as NaN)

henry0312 on 15 May 2017

@henry0312
For the one-hot coding without missing value, currently solution can work well, since the zero will only be treated as zero.
For the one-hot coding with missing value, we can use an additional column to represent missing or not. still can work.

If you don't have missing values, you can still treat zero as zero, not need to change them. Since the zero still will be also treated as zero inside.
If you have both missing values and zero values, actually you also can just set both of them to zero.

Keeping both NaN and zero for every feature, will break many designs for LightGBM, and cause the slowdown.

guolinke on 15 May 2017

@guolinke Thank you so much for your explanation!

If you don't have missing values, you can still treat zero as zero, not need to change them. Since the zero still will be also treated as zero inside.
If you have both missing values and zero values, actually you also can just both set them to zero.

I'm glad to know that.

henry0312 on 15 May 2017

@guolinke By the way,

If you have both missing values and zero values, actually you also can just both set them to zero.

Why will this work well? Is there a so smart algorithm?

henry0312 on 15 May 2017

@henry0312
What I mean is, it doesn't have much effect on accuracy if set both missing and zero to zero.
They cannot be identified as different values inside, since their values are same.

guolinke on 15 May 2017

@guolinke
I got it.
It is an acceptable compromise.

henry0312 on 15 May 2017

@henry0312 I think what @guolinke was explaining is the user should treat inputs to LightGBM as if they were sparse matrices, which is 0 for NA. This was the previous behavior of LightGBM, but the other way around.

Laurae2 on 21 May 2017

@guolinke

They cannot be identified as different values inside, since their values are same.

Even if the overall impact tends to be small, this treatment still seems less than ideal, especially if the user is unaware to manually distinguish true zeros in advance. Would either of these options would be a reasonable alternative?

a) Instead of internally mapping missing values to 0, could a less common number be used? i.e. Could missing values be internally mapped to -1.7e308 instead? Since zero tends to be a more common number in datasets than -1.7e308, this would likely have a less harmful impact.

b) Could true zeros be automatically changed to 1e-20f to distinguish them from missing values? Being unable to distinguish 0 and 1e-20f seems less harmful than being unable to distinguish 0 and missing values.

Thoughts?

rgranvil on 25 May 2017

@rgranvil
the main difficulty of doing that is: It is confused about the missing and zeros when input data is in sparse format ( like libsvm, CSR). Normally, the non-show-up value is zero, right ? However, when number of missing values is large, we may let the non-show-up value as missing.

I think we can have a parameter named zero_is_missing. if zero_is_missing=true, we treat all zeros as missing values (including all non-show-up value in sparse format). Otherwise, we use NA to represent missing.

I will figure a way to support this when I have time.

guolinke on 26 May 2017

@guolinke I have some problem with poor accuracy after #516.
I tried adding very small values to zero and flags to represent missing, but I couldn't get better accuracy before #516.
I want a swith option whether we support missing values or not...

henry0312 on 26 May 2017

@henry0312
OK, i can add it.
BTW, does the accuracy on training set have improvement ? it may be over-fitting.

guolinke on 26 May 2017

I appreciate it.

BTW, does the accuracy on training set have improvement ? it may be over-fitting.

I'll check.

By the way, xgboost seems to have missing option, though I don't know details of it.

missing : float, optional
Value in the data which needs to be present as a missing value. If None, defaults to np.nan.

https://xgboost.readthedocs.io/en/latest/python/python_api.html

henry0312 on 26 May 2017

@henry0312 in xgboost, if input is dense then you can specify the missing value element (NA by default). If using sparse input, then the missing value is the value which is not represented in the sparse matrix (you may have a value other than 0 for missing in a sparse matrix).

Laurae2 on 26 May 2017

@guolinke

BTW, does the accuracy on training set have improvement ? it may be over-fitting.

When I tried adding a small value to zero to be distinguished from missing, or adding flags to represent missing category, training accuracy seemed better (maybe over-fitting), but test accuracy wasn't better than that of before.

When I didn't try either adding a small value or flags, training accuracy wasn't good, but test accuracy also was bad.

@Laurae2 Thanks! I got it.

henry0312 on 26 May 2017

@guolinke for svmlight/libsvm we may enforce missing as missing value, while for dense data we should let the user choose what is a missing value.

Default could be NA (R) / NaN (Python) / non numeric (CLI), and we let the wrappers choose the missing value identification by adding a parameter for dataset construction.

Laurae2 on 26 May 2017

@guolinke How about trying mean(feature_values) in addtion to min, 0 and max?
I guess using mean will get better accuracy.

henry0312 on 30 May 2017

@henry0312
It is not so easy to use mean, since we don't store the original feature values in histogram-based algorithm.

update:
since the tree node will only split into 2 child nodes, so using min and max is enough.

guolinke on 30 May 2017

@guolinke Is this issue fixed?

Laurae2 on 16 Jul 2017

@Laurae2 not yet.
Can we have a dataset to test the effect of treating missing and zero value as the same value ?

guolinke on 25 Jul 2017

@guolinke I can generate one synthetic worst case scenario dataset and upload it here.

Laurae2 on 27 Jul 2017

move to #744 .

guolinke on 28 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

bug/segfault when using add_features_from and somewhat sparse data

ivinogra · 3Comments

how to customize the metric function in lightgbm such as ks?

jianqin123 · 3Comments

feval metric score ignored for early stopping with Python API

ClimbsRocks · 3Comments

How to set weight for each class?

heroxrq · 3Comments

R Package lgb.Dataset.construct() throwing api error: cannot open data file k

hack-r · 4Comments