Package used (python/R/jvm/C++): python / R
Hi,
Thank you for this awesome work! I am working on a project and have utilized xgboost as my ML algorithm. Normally, we will want to go to the next step in using this to optimize some variables. I.e: Using regression, we can model the project and using linear programming to obtain some values in order to optimize the target variable.
However, using xgboost, we do not have an equation. Is there any method I can use to obtain the best possible variable values for the best (highest) target variable? like optimization?
Regards
Germayne
Glad to hear that you found XGBoost to be useful. First of all, you should definitely try out gblinear, which gives you a linear regression model. There is active development for supporting linear boosting on the GPU.
As for the tree model, here's what you may want to try:
Gather all split thresholds used in all the trees. Let threshold[fid]
represent the list of distinct thresholds used for feature fid
. You should be able to do this by dumping the model.
Sort each threshold[fid]
in ascending order.
Pick values x[fid][0], x[fid][1], x[fid][2], ...
so as to satisfy the inequalities
x[fid][0] < threshold[fid][0]
threshold[fid][0] <= x[fid][1] < threshold[fid][1]
threshold[fid][1] <= x[fid][2] < threshold[fid][2]
...
threshold[fid][n-1] <= x[fid][n] < threshold[fid][n]
threshold[fid][n] <= x[fid][n+1]
An easy way to do this is to take the midpoint of each interval. Repeat this step for every threshold list threshold[fid]
.
{ x[0][i] : i = 0, 1, ... } x { x[1][i] : i = 0, 1, ... } x ... x { x[d-1][i] : i = 0, 1, ...}
(d = number of features)
and find the input that maximizes the output. This will consider n^d
possibilities, where n
is an upper bound on the number of threshold per feature.
A sample script (may be slow):
import xgboost
import json
import itertools
import numpy as np
def combine_thresh_table(old, new):
for k, v in new.items():
if k not in old:
old[k] = set()
old[k] |= v
return old
def get_thresh(tree):
table = dict()
if 'children' in tree:
fid = tree['split']
val = tree['split_condition']
if fid not in table:
table[fid] = set()
table[fid].add(val)
table = combine_thresh_table(table, get_thresh(tree['children'][0]))
table = combine_thresh_table(table, get_thresh(tree['children'][1]))
return table
def get_thresh_ensemble(ensemble):
table = dict()
for tree in ensemble:
table = combine_thresh_table(table, get_thresh(tree))
return table
def get_ensemble_json_dump(bst):
tree_dump = []
for dump in bst.get_dump(with_stats=True, dump_format='json'):
tree_dump.append(json.loads(dump))
return tree_dump
model_file = 'xgb_labels_model.bin'
num_feature = 14
bst = xgboost.Booster()
bst.load_model(model_file)
ensemble = get_ensemble_json_dump(bst)
threshold = get_thresh_ensemble(ensemble)
threshold = {k:list(v) for k,v in threshold.items()}
x = {k:[-np.finfo('f').max] + \
[(v[i] + v[i+1]) / 2.0 for i in range(len(v)-1)] + \
[np.finfo('f').max] for k, v in threshold.items()}
x = [ (x[fid] if fid in x else [np.nan]) for fid in range(num_feature) ]
print(x)
max_pred = 0
max_pred_input = None
for e in itertools.product(*x):
dtest = xgboost.DMatrix(np.array([e]))
pred = bst.predict(dtest, output_margin=True)[0]
if pred > max_pred:
max_pred_input = e[:]
max_pred = pred
print('max_pred so far = {}, with {}'.format(max_pred, max_pred_input))
UPDATE. Revised the script so that feature values appear in the correct order in the test matrix.
Hi @hcho3,
Thank you for providing this fantastic answer. I do have some more questions that i would love to clarify. From this, my understanding of xgb or tree algorithms in general is also much better !!
Some questions:
1) From my understanding, the way we create x and then e by using the split thresholds in the tree algorithm is because we want to create spread of values such that it will satisfy all the possibility ( pass through all the possible path in the tree. )
Is this understanding correct? For example, if there is only a single tree where the split is at x = 500, we only want to create 2 values, one below 500 and one more than 500 to achieve the 2 possible path down this split. Doesnt matter if the values are
400 , 600 vs 100 , 700 since it will result in the same score down the tree.
@germayneng I found a couple issues with my script and updated it. The issues were:
x
into list of lists so that every vector emitted by itertools
is of form [value for feature 0, value for feature 1, value for feature 2, ...]
.np.nan
.@germayneng Yes, your understanding is correct. We generate x
and e
so as to simulate all possible traversal paths. In your single-tree example with split x=500, it doesn't matter whether we pick (400, 600) or (100, 700); either pair will exhaust the two possible outcomes.
Let me know if you have any more questions.
@germayneng I just found this interesting paper: https://arxiv.org/pdf/1705.10883.pdf. In cases where brute-force iteration over the candidate space (the Cartesian product I mentioned above) is infeasible, you may be able to form a kind of mixed integer programming and solve it approximately.
@hcho3 Thank you for the fast reply.
May i get more clarification on
Some features never appear in any split condition. For these features, simply assign np.nan.
I did realized that on my end, there are some features that are not in threshold. I have 30 variables. Only 28 variables key in threshold.
So what i did was to manually add them:
# add missing variable
threshold['f8'] = [0.5]
threshold['f13'] = [0.5]
Concept wise, what is the reason these features are not in the threshold? They are also the same variables that are missing in feature_importance. These 2 variables are in fact part of one-hot-encoding features. (they are dummy variables)
As mentioned above by you, so i should instead do this:
# add missing variable
threshold['f8'] = [np.nan]
threshold['f13'] = [np.nan]
is this correct?
Edit: Was thinking about this and that the reason these features does not have any split conditions is because they are not as important / not important or meaningful. Hence, feature importance wise they are not even used and hence, we do not have them in threshold (no splits at all from them )
@germayneng Please refer to the revised script, which deals with the missing feature. Conceptually, each split is chosen in Greedy fashion to improve the global loss function. Intuitively, those features that help us best predict the target will be chosen in the splits. There is no guarantee at all for all features getting picked. In fact, if you have a high-dimensional data, say of 1000 features, and you fit a tree of only 50 splits, some features will be left out for sure.
@hcho3 this makes perfect sense! Thank you.
On the side note: the research paper you linked is very interesting. I will spend some time looking through it. However, i do have a question. When you mean forming a MIP to solve, do i utilize modules that solves MIP problems and apply it to the xgb model?
@germayneng The paper presents a formulation of a MIP for a tree ensemble. So yes, you will have to extract necessary information from xgb model, construct the MIP, and solve it using an optimization package.
The new discussions are now welcomed at https://discuss.xgboost.ai/