I noticed that trees_to_dataframe
retuns an exception (index out of bounds) if I train boolean variables.
The problem shows in core.py
, between lines 1807 -- 1830 (in xgb 1.0.2). The way trees_to_dataframe works is first by dumping the trees into a string and then parsing that string using regexes! (urgh!)
Now, considering non-leaf nodes, the string looks like
[myFeature1<3.1415926] yes=1,no=2,missing=1,gain=4704.1665,cover=80187.7188
This string is then split here:
core.py, Line 1809
fid = arr[1].split(']')
parse = fid[0].split('<')
stats = re.split('=|,', fid[1])
Thus fid[0]
is the cut and parse
further contains left and right side of the <
symbol in parse[0]
and parse[1]
.
Now, for boolean variables I get a string like this:
[myBooleanFeature] yes=200,no=199,gain=82.0015259,cover=739.93811
Note: there is no <
symbol! (it seems in previous versions it was <0.5
instead, if I remember correctly)
Since there is no <
, the split only returns 1 element and parse[1]
is out of bounds.
Also the statistics do not contain missing=1234,
, i.e. missing
is missing... And therefore the stats
array has two fewer elements. Thus later in the code, covers.append(float(stats[9]))
fails since the maximum index is only 7.
As for fixing: it is easy to put in a quick fix but I'd really like to see the magic numbers (indices of the stats array) removed. It would be much more readable to say something like
stats = dict([tuple(kvpair.split("=")) for kvpair in fid[1].split(",")])
# now use stats['cover'] instead of stats[9]
# and stats.get('missing', np.nan) instead of stats[5] (or some other default value)
# or fix it by always storing 'missing' values
Not that I like the general idea of parsing strings though :-)
I think we should eliminate string parsing entirely. It would be better to use JSON representation of the tree and then manipulate the resulting JSON object.
Consolidated to #6091
Most helpful comment
I think we should eliminate string parsing entirely. It would be better to use JSON representation of the tree and then manipulate the resulting JSON object.