Tpot: XGboost Feature mismatch

Created on 8 Aug 2018 · 20Comments · Source: EpistasisLab/tpot

When running TPOT classifier, when the XGBoost is selected as a final model I get feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30'] ['Interval90-120_Ratio', 'Interval90-120_OtherRatio', ...] expected f20, f10, f11, f9, f1, f0, f6, f29, f21, f13, f4, f12, f24, f14, f3, f25, f2, f28, f16, f15, f19, f8, f18, f5, f22, f26, f7, f17, f23, f30, f27 in input data training data did not have the following fields: Interval90-120_Ratio, Interval90-120_OtherRatio, ....

Seems there may be a problem with XGBoost when the columns contains certain characters.

Running TPOT==0.9.3, xgboost==0.72 on mac.

question

Source

jaksmid

Most helpful comment

Thanks @ianozsvald , that was it! Now it is working fine.

GinoWoz1 on 9 Sep 2018

👍3 🎉2

All 20 comments

Hmm, what is the exported pipeline?

weixuanfu on 8 Aug 2018

Let me export the pipeline once the error reappears.

But I think it is related to this: https://github.com/dmlc/xgboost/issues/2334
Although the bug was closed, it was concluded with the tip on how to handle such cases.

jaksmid on 9 Aug 2018

@weixuanfu Should this be labeled as question though?
The pipeline can be running for days and crashes in the last step.

jaksmid on 9 Aug 2018

@jaksmid I labeled this one as question for now since there is no clear description of this issue so far. I will relabeled this one if I think this issue is caused by a bug in TPOT. Please let me know the pipeline for some clues about this issue (like, why 'Interval90-120_Ratio' was generated?)

weixuanfu on 9 Aug 2018

@weixuanfu Sorry for not being that clear. I am inserting pandas datadrame with column names consisting of characters from the set {a..zA..Z0..9-_} into fit and predict. Usually, the process goes through without any problem but if the XGBoost is in the final pipeline I get an exception that seems to be related to the set of characters I am using for the column names of the dataframe. I have observed the error several times, let me dump the stacktrace once I observe it once again.

jaksmid on 9 Aug 2018

I think this is the same issue that I have with ELI5 using XGBoost, DataFrame and sklearn: https://github.com/TeamHG-Memex/eli5/issues/256

The above bug links to the xgboost #2334 that you've noted above. With ELI5 I've had to use .values to get a ndarray out of the DataFrame otherwise the PermutationImportance function complains with the same error you note.

As noted in #2334 I guess this is somewhere in the interface between sklearn and XGBoost, but I don't know where.

ianozsvald on 14 Aug 2018

👍2

I have just run into the same issue as well.

The exported pipeline is the below....

exported_pipeline = make_pipeline(
StackingEstimator(estimator=XGBRegressor(learning_rate=0.01, max_depth=8, min_child_weight=16, n_estimators=100, nthread=1, subsample=0.6000000000000001)),
ExtraTreesRegressor(bootstrap=False, max_features=0.45, min_samples_leaf=1, min_samples_split=10, n_estimators=100)
)

GinoWoz1 on 8 Sep 2018

@GinoWoz1 my guess would be that if you pass in a numpy array (i.e. using df.values) at the start, you won't see this problem, but if you pass in the DataFrame then you get this issue. Could you confirm?

ianozsvald on 8 Sep 2018

Thanks @ianozsvald , that was it! Now it is working fine.

GinoWoz1 on 9 Sep 2018

👍3 🎉2

have the same problem, but .values doesn't work :/

DataCrane on 20 Nov 2018

@DataCrane could you provide a demo to reproduce this issue?

weixuanfu on 20 Nov 2018

I am facing the mismatch problem when making a prediction with a trained XGB model, but not when training the model with the pandas dataframe. (My env: XGBoost 0.81, pandas 0.23.4, scikit-learn 0.20.2, python 3.6, & a Mac anaconda setting)

I am struggling with this problem a couple of days. Please help.

Here is my code:

df = prep_dataset()

training_set, test_set = train_test_split(df, test_size=0.30, random_state=1)
training_x = training_set[selected_x]
training_y = training_set[selected_y]
test_x = test_set[selected_x]
test_y = test_set[selected_y]

dtrain = xgb.DMatrix(training_x, label=training_y)
dtest = xgb.DMatrix(test_x, label=test_y)

param = {
'max_depth': 3,
'eta': 0.3,
'silent': 1,
'objective': 'multi:softprob',
'num_class': 3}
num_round = 500

bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)
best_preds = np.asarray([np.argmax(line) for line in preds])
print("Overall Precision Score = ", precision_score(test_y, best_preds, average='macro'))

So far there has been no problem. Now here goes the mismatch problem part:

pred_single = xgb.DMatrix(np.asmatrix([120, 1, 1, 31, 12, 7]))
best_preds = np.asarray([np.argmax(bst.predict(pred_single))])
print("Single Point Prediction = ", best_preds )

Then, it generates the "ValueError: feature_names mismatch: ...." problem.

chocone on 2 Jan 2019

@chocone can you try with training_set, test_set = train_test_split(df.values instead of df, that way you build DMatrix using the underlying numpy object and not a DataFrame object? I had to use the numpy objects (with no DataFrames anywhere) to get rid of this problem.
Your situation is different to mine, but removing the DataFrame seems like a sensible first test.

ianozsvald on 2 Jan 2019

@ianozsvald Big thanks! I see your point. Although giving up the DataFrame is not a trivial issue, your suggestion should work to me. ^^

chocone on 3 Jan 2019

Ok I'm still having this issue and none of what I've read so far seems to be helping my code.

If I leave the X_train and X_test features as a dataframe, if XGBRegressor is in my pipeline, I get the error Feature Names Mismatch, and a bunch of "f0, f1, f2" stuff, so I convert the features of both sets to numpy arrays, trying both .values and .as_matrix(), which is fine, but then if RandomForestRegressor or DecisionTreeRegressor is used in my pipeline, when I run the .predict(X_test) function, I get:

ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118

...even though printing the shape of the X_test array before .predict() is showing 117!!??

Check out my Colab workbook here where I've saved the error of the latter issue. You can make comments there. The code is VERY well commented.

I'm at a loss as to what to do. I'd like to be able to have both XGBRegressor and all the tree regressors in my project, but it seems that they both prefer having their data sent to them in different ways. It's frustrating running a pipeline for a while, only to have it crash at the predict() phase, EVEN THOUGH IT SHOWS THE X_TEST ARRAY SHAPE AS BEING 117, NOT 118!?!? Any suggestions? Thanks.

Also see my question here:

windowshopr on 15 Aug 2019

exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]

The codes above cause the issue here. I think best_model.fitted_pipeline_ should have a step including StackingEstimator for stacking the predictions of another regressor as synthetic features, thus the last step used 1 more feature. If you only need the regressor in the last step, you should refit it. Or just use the whole pipeline via exctracted_best_model = best_model.fitted_pipeline_

Ok I'm still having this issue and none of what I've read so far seems to be helping my code.

If I leave the X_train and X_test features as a dataframe, if XGBRegressor is in my pipeline, I get the error Feature Names Mismatch, and a bunch of "f0, f1, f2" stuff, so I convert the features of both sets to numpy arrays, trying both .values and .as_matrix(), which is fine, but then if RandomForestRegressor or DecisionTreeRegressor is used in my pipeline, when I run the .predict(X_test) function, I get:

ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118

...even though printing the shape of the X_test array before .predict() is showing 117!!??

Check out my Colab workbook here where I've saved the error of the latter issue. You can make comments there. The code is VERY well commented.

I'm at a loss as to what to do. I'd like to be able to have both XGBRegressor and all the tree regressors in my project, but it seems that they both prefer having their data sent to them in different ways. It's frustrating running a pipeline for a while, only to have it crash at the predict() phase, EVEN THOUGH IT SHOWS THE X_TEST ARRAY SHAPE AS BEING 117, NOT 118!?!? Any suggestions? Thanks.

Also see my question here:

weixuanfu on 15 Aug 2019

@weixuanfu Thanks for your input, but I'm not 100% sure what you're talking about with the StackingEstimator or what it's used for.

The section with the exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1] is to be able to find the feature importances using the pipeline that was created with TPOT, as per this answer on StackOverflow. Is there a better way for me to get the feature importances after the best pipeline has been created?

I'm going to move that entire section:

# Extract what the best pipeline was and fit it to the training set
# to get an idea of the most important features used by the model were.
# Is there a way to do this during the first .fit() function so that
# we don't have to run it a second time here? Althought it's not a big deal.
print('Now fit the best pipeline to the training data to find feature importances...')
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]

# Train only the `exctracted_best_model` using the training/vildation set
exctracted_best_model.fit(X_train, Y_train)

# plot model's feature importance and save the plot for later
feature_importance = exctracted_best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos        = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(40,20)) # play with this to adjust the size of the plot to your specs
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, df.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("feature_importance.png")
plt.clf()
plt.close()
print('Done!')

...to after the section about predicting on the "prediction dataset" so that the exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1] doesn't interfere with any of the steps before the first .predict() function and see if that does anything?

windowshopr on 16 Aug 2019

@weixuanfu Moving that section to the bottom of the code seemed to do the trick, as well as refitting the model to the "features, label" sets in the predictions dataframe part before calling .predict() there. If I spot another issue, I'll bring it up after a few more tests. Thanks a lot!! Saved me such a headache!

windowshopr on 16 Aug 2019

@windowshopr no problem! Regarding another way for getting features importance from pipelines, you may try to estimate permutation importance.

weixuanfu on 16 Aug 2019

I'll look into those @weixuanfu , thanks again!

windowshopr on 16 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings