Lightgbm: Lack some values when saving lgb to text file

Created on 31 Jan 2019 · 8Comments · Source: microsoft/LightGBM

Hi, I have a problem with saving lgb model to text file.

My purpose: After using python to train a model, I want to save it to a text file, and then use jpmml to convert the text file to pmml for java application).

When saving the model to text file, I realize that in some categorical features, the text file lacks some values. For example, a feature in training data can receive values {1, 2, 3, 4, 5, 6}, but in the text file, at feature_infos section, it just is {1, 2, 3, 5, 6} (lack of {4}). I don't know why it happens.
I use 2 lgb models, sklearn and standalone ones but both txt files seem don't contain {4}.

My questions are:
1, Is it a bug of lgb ?
2, Is feature_infos important? Does it have any impact on the model when I manually change it ?

I provide the code (jupyter-notebook) and the data here

Thank you.

Source

chaupmcs

All 8 comments

To improve the generalization ability, some categories with little data will be ignored in the model.
This is a feature, not a bug.

guolinke on 31 Jan 2019

@guolinke Thanks for the quick response.
Is there any way to save the text file with all of values? I need a full information before converting the text to pmml version.
How about the second question, if there are no official ways to keep all the values, is it safe when I manually change the values in feature_infos section of the text file, such as manually add "4" value to occupation feature in the example?
Thanks.

chaupmcs on 31 Jan 2019

This issue describes a workflow Python/Scikit-Learn -> LGBM text file -> JPMML-LightGBM, which relies on model "data schema" information as stored in the LGBM text file.

A much better workflow would be Python/Scikit-Learn -> SkLearn2PMML, which gets the model "data schema" straight from the Scikit-Learn pipeline. So, even if the LightGBM algorithm decides that some category levels are insignificant and does not store them in the intermediate LGBM text file, the SkLearn2PMML converter still knows about them!

Example workflow:

from sklearn2pmml.decoration import CategoricalDomain

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    ("my_categorical_column", [CategoricalDomain(), LabelBinarizer()])
  ])),
  ("regressor", LGBMRegressor())
])
pipeline.fit(X, y)

sklearn2pmml(pipeline, "pipeline.pmml")

vruusmann on 31 Jan 2019

👍1

@vruusmann: Thank you so much for your advice. It really helped me a lot.
However I still have a problem with the new method. I applied your suggestion to the example (at the end of the file.

I copy some pieces of code below. When I calculate log_loss error for the pipeline model, it returns a different result compared to the _normal_ way. I'm wondering how can we check the model created by pipeline to make sure it yields the same model with lightgbm's? The second thing I need your help is that why the 2 models returns different log_loss error. What am I missing here? Thanks a lot.

mapper = DataFrameMapper([
   ('Employment', CategoricalDomain()),
   ('Education', CategoricalDomain()),
   ('Marital', CategoricalDomain()),
   ('Occupation', CategoricalDomain()),
   ('Gender', CategoricalDomain()),
   ('Deductions', CategoricalDomain()),
   (['Hours', 'Income', 'Age'], ContinuousDomain(with_data = False))
 ])

classifier = lgb.LGBMClassifier(n_estimators=2, learning_rate=0.1, num_leaves=10, max_depth=2)

pipeline = PMMLPipeline([
   ("mapper", mapper),
   ("classifier", classifier)
])

pipeline.fit(X = X_train, y = y_train)
sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")

## compare the log_loss:
log_loss(y_train, pipeline.predict_proba(X_train))    # 0.509684592242956
log_loss(y_train, lgb_sklearn_model.predict_proba(X_train))    # 0.4975041326312271

chaupmcs on 31 Jan 2019

@chaupmcs Had forgotten about this, but you also need to suppress LGBM's default "categorical feature auto-detection"-algorithm by supplying the list of categorical column indices as the categorical_feature attribute.

So, my above Python code example should really look like this:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([ .. ])),
  ("regressor", LGBMRegressor())
])
# THIS: specify '<estimator step name>__categorical_feature' kwarg
pipeline.fit(X, y, regressor__categorical_feature = [0])

Complete example here:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L206-L226

Pay attention to this line:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L226

vruusmann on 31 Jan 2019

👍1

@vruusmann Thanks for pointing out the problem.
Now the pipeline model returns the same log_loss error as lightgbm model does.
But, I can not save the model to pmml format anymore.

params = {'classifier__categorical_feature': [2, 3, 4, 5] }   # add this line
pipeline.fit(X = X_train, y = y_train, **params)     # put the params into  fit() 
pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name)      # add this line
pipeline.predict_proba(X_train)   # ok, no errors, returns the same with lightgbm model

sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")    # Error here

In the link you gave me, I see you're using store_pkl to dump to pickle files, no example for pmml format. I don't know how the error comes. Please help me to fix it. Thank you!

---- The error below (full the example code and result here)

Standard output is empty
Standard error:
Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 30 ms.
Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run
INFO: Converting..
Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IndexOutOfBoundsException: Index: 13, Size: 13
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240)
at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151)
at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186)
at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94)
at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66)
at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49)
at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287)
at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58)
at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39)
at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 13, Size: 13
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240)
at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151)
at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186)
at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94)
at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66)
at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49)
at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287)
at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58)
at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39)
at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)

RuntimeError Traceback (most recent call last)
in ()
28 pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name)
29
---> 30 sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")
31
32 pipeline.predict_proba(X_train)

~/.local/lib/python3.6/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding)
244 print("Standard error is empty")
245 if retcode:
--> 246 raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
247 finally:
248 if debug:
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

chaupmcs on 31 Jan 2019

@chaupmcs It's probably this line, which is completely unnecessary: pipeline = make_pmml_pipeline(pipeline)

Anyway, this dicussion is getting off-topic, and we should move it to JPMML's "namespace" instead.

vruusmann on 31 Jan 2019

👍1

@vruusmann Thank you for the advice. I tried comment out the line but the error still occurred. I will close this topic and open a new one in JPMML. Thanks again 👍

chaupmcs on 31 Jan 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings