Xgboost: return maximum output value

Created on 22 May 2018  路  11Comments  路  Source: dmlc/xgboost

Package used (python/R/jvm/C++): python / R

Hi,

Thank you for this awesome work! I am working on a project and have utilized xgboost as my ML algorithm. Normally, we will want to go to the next step in using this to optimize some variables. I.e: Using regression, we can model the project and using linear programming to obtain some values in order to optimize the target variable.

However, using xgboost, we do not have an equation. Is there any method I can use to obtain the best possible variable values for the best (highest) target variable? like optimization?

Regards
Germayne

All 11 comments

Glad to hear that you found XGBoost to be useful. First of all, you should definitely try out gblinear, which gives you a linear regression model. There is active development for supporting linear boosting on the GPU.

As for the tree model, here's what you may want to try:

  1. Gather all split thresholds used in all the trees. Let threshold[fid] represent the list of distinct thresholds used for feature fid. You should be able to do this by dumping the model.

  2. Sort each threshold[fid] in ascending order.

  3. Pick values x[fid][0], x[fid][1], x[fid][2], ... so as to satisfy the inequalities

                       x[fid][0]   < threshold[fid][0]
threshold[fid][0]   <= x[fid][1]   < threshold[fid][1]
threshold[fid][1]   <= x[fid][2]   < threshold[fid][2]
...
threshold[fid][n-1] <= x[fid][n]   < threshold[fid][n]
threshold[fid][n]   <= x[fid][n+1]

An easy way to do this is to take the midpoint of each interval. Repeat this step for every threshold list threshold[fid].

  1. Now iterate over the Cartesian product
{ x[0][i] : i = 0, 1, ... } x { x[1][i] : i = 0, 1, ... } x ... x { x[d-1][i] : i = 0, 1, ...}
(d = number of features)

and find the input that maximizes the output. This will consider n^d possibilities, where n is an upper bound on the number of threshold per feature.

A sample script (may be slow):

import xgboost
import json
import itertools
import numpy as np

def combine_thresh_table(old, new):
  for k, v in new.items():
    if k not in old:
      old[k] = set()
    old[k] |= v
  return old

def get_thresh(tree):
  table = dict()
  if 'children' in tree:
    fid = tree['split']
    val = tree['split_condition']
    if fid not in table:
      table[fid] = set()
    table[fid].add(val)
    table = combine_thresh_table(table, get_thresh(tree['children'][0]))
    table = combine_thresh_table(table, get_thresh(tree['children'][1]))
  return table

def get_thresh_ensemble(ensemble):
  table = dict()
  for tree in ensemble:
    table = combine_thresh_table(table, get_thresh(tree))
  return table

def get_ensemble_json_dump(bst):
  tree_dump = []
  for dump in bst.get_dump(with_stats=True, dump_format='json'):
    tree_dump.append(json.loads(dump))
  return tree_dump

model_file = 'xgb_labels_model.bin'
num_feature = 14

bst = xgboost.Booster()
bst.load_model(model_file)
ensemble = get_ensemble_json_dump(bst)
threshold = get_thresh_ensemble(ensemble)
threshold = {k:list(v) for k,v in threshold.items()}
x = {k:[-np.finfo('f').max] + \
       [(v[i] + v[i+1]) / 2.0 for i in range(len(v)-1)] + \
       [np.finfo('f').max] for k, v in threshold.items()}
x = [ (x[fid] if fid in x else [np.nan]) for fid in range(num_feature) ]
print(x)

max_pred = 0
max_pred_input = None
for e in itertools.product(*x):
  dtest = xgboost.DMatrix(np.array([e]))
  pred = bst.predict(dtest, output_margin=True)[0]
  if pred > max_pred:
      max_pred_input = e[:]
      max_pred = pred
      print('max_pred so far = {}, with {}'.format(max_pred, max_pred_input))

UPDATE. Revised the script so that feature values appear in the correct order in the test matrix.

Hi @hcho3,

Thank you for providing this fantastic answer. I do have some more questions that i would love to clarify. From this, my understanding of xgb or tree algorithms in general is also much better !!

Some questions:

1) From my understanding, the way we create x and then e by using the split thresholds in the tree algorithm is because we want to create spread of values such that it will satisfy all the possibility ( pass through all the possible path in the tree. )

Is this understanding correct? For example, if there is only a single tree where the split is at x = 500, we only want to create 2 values, one below 500 and one more than 500 to achieve the 2 possible path down this split. Doesnt matter if the values are
400 , 600 vs 100 , 700 since it will result in the same score down the tree.

@germayneng I found a couple issues with my script and updated it. The issues were:

  • The feature values were not in the correct order in the test matrix. I convert x into list of lists so that every vector emitted by itertools is of form [value for feature 0, value for feature 1, value for feature 2, ...].
  • Some features never appear in any split condition. For these features, simply assign np.nan.

@germayneng Yes, your understanding is correct. We generate x and e so as to simulate all possible traversal paths. In your single-tree example with split x=500, it doesn't matter whether we pick (400, 600) or (100, 700); either pair will exhaust the two possible outcomes.

Let me know if you have any more questions.

@germayneng I just found this interesting paper: https://arxiv.org/pdf/1705.10883.pdf. In cases where brute-force iteration over the candidate space (the Cartesian product I mentioned above) is infeasible, you may be able to form a kind of mixed integer programming and solve it approximately.

@hcho3 Thank you for the fast reply.

May i get more clarification on

Some features never appear in any split condition. For these features, simply assign np.nan.

I did realized that on my end, there are some features that are not in threshold. I have 30 variables. Only 28 variables key in threshold.
So what i did was to manually add them:

# add missing variable 
threshold['f8'] = [0.5]
threshold['f13'] = [0.5]

Concept wise, what is the reason these features are not in the threshold? They are also the same variables that are missing in feature_importance. These 2 variables are in fact part of one-hot-encoding features. (they are dummy variables)

As mentioned above by you, so i should instead do this:

# add missing variable 
threshold['f8'] = [np.nan]
threshold['f13'] = [np.nan]

is this correct?

Edit: Was thinking about this and that the reason these features does not have any split conditions is because they are not as important / not important or meaningful. Hence, feature importance wise they are not even used and hence, we do not have them in threshold (no splits at all from them )

@germayneng Please refer to the revised script, which deals with the missing feature. Conceptually, each split is chosen in Greedy fashion to improve the global loss function. Intuitively, those features that help us best predict the target will be chosen in the splits. There is no guarantee at all for all features getting picked. In fact, if you have a high-dimensional data, say of 1000 features, and you fit a tree of only 50 splits, some features will be left out for sure.

@hcho3 this makes perfect sense! Thank you.

On the side note: the research paper you linked is very interesting. I will spend some time looking through it. However, i do have a question. When you mean forming a MIP to solve, do i utilize modules that solves MIP problems and apply it to the xgb model?

@germayneng The paper presents a formulation of a MIP for a tree ensemble. So yes, you will have to extract necessary information from xgb model, construct the MIP, and solve it using an optimization package.

The new discussions are now welcomed at https://discuss.xgboost.ai/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vkuznet picture vkuznet  路  3Comments

FabHan picture FabHan  路  4Comments

wenbo5565 picture wenbo5565  路  3Comments

XiaoxiaoWang87 picture XiaoxiaoWang87  路  3Comments

RanaivosonHerimanitra picture RanaivosonHerimanitra  路  3Comments