LightGBM does not yet use the training data to inform the way it handles missing values. Instead, it seems missing values are just treated as 0's, leading to worse predictions than other frameworks such as XGBoost that explicitly process missing values.
XGBoost's solution works quite well and the gist of it seems compatible with LightGBM's procedures, so I guess this would be a good start. The general idea is to alter the best threshold functions to run twice; once with missing values assigned to the left node, and once with missing values assigned to the right node, and keep track of the best missing values assignment for each node. More detailed information about XGBoost's implementation is available here: https://arxiv.org/pdf/1603.02754v3.pdf.
@Allardvm
I think it is not hard to adapt this feature. Following are some key points:
nan or na.Welcome to contribute to this feature. I think it may have more problems..
we can have more discussions during the development.
Thanks!
Yes, it shouldn't be too hard to implement the algorithm itself, but it's gonna be tricky to avoid subtle bugs. As for your points:
Some other points that should probably be tackled:
bool missing_in_left; to SplitInfo that is true when missing values for a particular feature should be assigned to the left leaf and false when they should be assigned to the right leaf.missing_in_left.missing_in_left.Moreover it seems that colums with NA + one value is considered as a constant and therefore dropped.
"[LightGBM] [Warning] Ignoring Column_XX , only has one value"
And therefore i have worse results as with XGBOOST.
One major tricks of XGBOOST is when you have a column with only 2 value (Sex: Man/Woman, or dummy a categorical feature but with no pre-existing NA) you can encode one of the two value as NA. XGBOOST handles it perfectly and the complexity by columns is reduced from nlog(n) to n_{not na}log(n_{not na}). Therefore there is a huge boost in speed without a loss of accuracy.
@guolinke What is the current state of missing values in LightGBM?:
I've seen commits for it but I'm not sure what belongs to what currently.
@Laurae2 It just let NA==0 for now.
@guolinke I don't think it is good to treat zero values as missing values, because categorical features is often be dummied and represented 0 or 1. The zero doesn't mean missing.
@henry0312 I thought about that.
@guolinke I'm sorry, I don't understand what you mean....
1 for missing, 0 for not missing
Why does 1 mean missing?
(I think dummy often do one-hot encoding)
To treat zero as zero after #516, do many users have to convert zero values manually so that thier absolute value are greater than 1e-20f?
I wonder if there is an another way to treat missing values (such as NaN)
@henry0312
For the one-hot coding without missing value, currently solution can work well, since the zero will only be treated as zero.
For the one-hot coding with missing value, we can use an additional column to represent missing or not. still can work.
If you don't have missing values, you can still treat zero as zero, not need to change them. Since the zero still will be also treated as zero inside.
If you have both missing values and zero values, actually you also can just set both of them to zero.
Keeping both NaN and zero for every feature, will break many designs for LightGBM, and cause the slowdown.
@guolinke Thank you so much for your explanation!
If you don't have missing values, you can still treat zero as zero, not need to change them. Since the zero still will be also treated as zero inside.
If you have both missing values and zero values, actually you also can just both set them to zero.
I'm glad to know that.
@guolinke By the way,
If you have both missing values and zero values, actually you also can just both set them to zero.
Why will this work well? Is there a so smart algorithm?
@henry0312
What I mean is, it doesn't have much effect on accuracy if set both missing and zero to zero.
They cannot be identified as different values inside, since their values are same.
@guolinke
I got it.
It is an acceptable compromise.
@henry0312 I think what @guolinke was explaining is the user should treat inputs to LightGBM as if they were sparse matrices, which is 0 for NA. This was the previous behavior of LightGBM, but the other way around.
@guolinke
They cannot be identified as different values inside, since their values are same.
Even if the overall impact tends to be small, this treatment still seems less than ideal, especially if the user is unaware to manually distinguish true zeros in advance. Would either of these options would be a reasonable alternative?
a) Instead of internally mapping missing values to 0, could a less common number be used? i.e. Could missing values be internally mapped to -1.7e308 instead? Since zero tends to be a more common number in datasets than -1.7e308, this would likely have a less harmful impact.
b) Could true zeros be automatically changed to 1e-20f to distinguish them from missing values? Being unable to distinguish 0 and 1e-20f seems less harmful than being unable to distinguish 0 and missing values.
Thoughts?
@rgranvil
the main difficulty of doing that is: It is confused about the missing and zeros when input data is in sparse format ( like libsvm, CSR). Normally, the non-show-up value is zero, right ? However, when number of missing values is large, we may let the non-show-up value as missing.
I think we can have a parameter named zero_is_missing. if zero_is_missing=true, we treat all zeros as missing values (including all non-show-up value in sparse format). Otherwise, we use NA to represent missing.
I will figure a way to support this when I have time.
@guolinke I have some problem with poor accuracy after #516.
I tried adding very small values to zero and flags to represent missing, but I couldn't get better accuracy before #516.
I want a swith option whether we support missing values or not...
@henry0312
OK, i can add it.
BTW, does the accuracy on training set have improvement ? it may be over-fitting.
I appreciate it.
BTW, does the accuracy on training set have improvement ? it may be over-fitting.
I'll check.
By the way, xgboost seems to have missing option, though I don't know details of it.
missing : float, optional
Value in the data which needs to be present as a missing value. If None, defaults to np.nan.
https://xgboost.readthedocs.io/en/latest/python/python_api.html
@henry0312 in xgboost, if input is dense then you can specify the missing value element (NA by default). If using sparse input, then the missing value is the value which is not represented in the sparse matrix (you may have a value other than 0 for missing in a sparse matrix).
@guolinke
BTW, does the accuracy on training set have improvement ? it may be over-fitting.
When I tried adding a small value to zero to be distinguished from missing, or adding flags to represent missing category, training accuracy seemed better (maybe over-fitting), but test accuracy wasn't better than that of before.
When I didn't try either adding a small value or flags, training accuracy wasn't good, but test accuracy also was bad.
@Laurae2 Thanks! I got it.
@guolinke for svmlight/libsvm we may enforce missing as missing value, while for dense data we should let the user choose what is a missing value.
Default could be NA (R) / NaN (Python) / non numeric (CLI), and we let the wrappers choose the missing value identification by adding a parameter for dataset construction.
@guolinke How about trying mean(feature_values) in addtion to min, 0 and max?
I guess using mean will get better accuracy.
@henry0312
It is not so easy to use mean, since we don't store the original feature values in histogram-based algorithm.
update:
since the tree node will only split into 2 child nodes, so using min and max is enough.
@guolinke Is this issue fixed?
@Laurae2 not yet.
Can we have a dataset to test the effect of treating missing and zero value as the same value ?
@guolinke I can generate one synthetic worst case scenario dataset and upload it here.
move to #744 .