Xgboost: trees_to_dataframe causing index out of bounds exception

Created on 12 Mar 2020 · 2Comments · Source: dmlc/xgboost

I noticed that trees_to_dataframe retuns an exception (index out of bounds) if I train boolean variables.

The problem shows in core.py, between lines 1807 -- 1830 (in xgb 1.0.2). The way trees_to_dataframe works is first by dumping the trees into a string and then parsing that string using regexes! (urgh!)

Now, considering non-leaf nodes, the string looks like
[myFeature1<3.1415926] yes=1,no=2,missing=1,gain=4704.1665,cover=80187.7188
This string is then split here:

core.py, Line 1809
                    fid = arr[1].split(']')
                    parse = fid[0].split('<')
                    stats = re.split('=|,', fid[1])

Thus fid[0] is the cut and parse further contains left and right side of the < symbol in parse[0] and parse[1].

Now, for boolean variables I get a string like this:
[myBooleanFeature] yes=200,no=199,gain=82.0015259,cover=739.93811
Note: there is no < symbol! (it seems in previous versions it was <0.5 instead, if I remember correctly)
Since there is no <, the split only returns 1 element and parse[1] is out of bounds.
Also the statistics do not contain missing=1234,, i.e. missing is missing... And therefore the stats array has two fewer elements. Thus later in the code, covers.append(float(stats[9])) fails since the maximum index is only 7.

As for fixing: it is easy to put in a quick fix but I'd really like to see the magic numbers (indices of the stats array) removed. It would be much more readable to say something like

stats = dict([tuple(kvpair.split("=")) for kvpair in fid[1].split(",")])
# now use stats['cover'] instead of stats[9]
# and stats.get('missing', np.nan) instead of stats[5] (or some other default value)
# or fix it by always storing 'missing' values

Not that I like the general idea of parsing strings though :-)

bug

Source

behrenhoff

Most helpful comment

I think we should eliminate string parsing entirely. It would be better to use JSON representation of the tree and then manipulate the resulting JSON object.

hcho3 on 12 Mar 2020

👍2

All 2 comments

I think we should eliminate string parsing entirely. It would be better to use JSON representation of the tree and then manipulate the resulting JSON object.

hcho3 on 12 Mar 2020

👍2

Consolidated to #6091

hcho3 on 8 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings