I am trying out tpot.fit with kaggle playground competition data "Ghouls, Goblins, and Ghosts... Boo!"
and it keeps hitting
ValueError: There was an error in the TPOT optimization process. This could be because the data was
not formatted properly, or because data for a regression problem was provided to the TPOTClassifier
object. Please make sure you passed the data to TPOT correctly.
The features looks like these
combined_var rotting_flesh bone_length has_soul hair_length
219 0.531771 0.415485 0.489446 0.503475 0.825953
262 0.387865 0.497112 0.209997 0.327212 0.642451
6 0.434517 0.568952 0.399331 0.467901 0.618391
359 0.421345 0.172182 0.626017 0.644941 0.408422
220 0.223747 0.648866 0.168909 0.255440 0.303927
and the target looks like
type
219 0.0
262 1.0
6 1.0
359 0.0
220 2.0
these are the two lines I am using
tpot = TPOTClassifier(verbosity=3)
tpot.fit(p_train[train_cols], p_train[target_var])
p_train is a dataframe object.
I checked the code, its not able to generate an optimized pipeline and here is the traceback
ValueError Traceback (most recent call last)
<ipython-input-24-670421e6f822> in <module>()
----> 1 tpot.fit(p_train[train_cols], p_train[target_var])
/home/jayant/anaconda3/lib/python3.5/site-packages/tpot/base.py in fit(self, features, classes)
354
355 if not self._optimized_pipeline:
--> 356 raise ValueError('There was an error in the TPOT optimization '
357 'process. This could be because the data was '
358 'not formatted properly, or because data for '
ValueError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOT
Could you please let us know more details about the codes for inputting the dataset?
Could you please also try the codes below?
import numpy as np
...
tpot.fit(np.array(p_train[train_cols]), np.array(p_train[target_var]))
@weixuanfu2016 thanks for replying
I tried what you suggested and still got the same error. The program runs upto 100% but in all generations
I get
Generation 100 - Current Pareto front scores:
5000 inf GradientBoostingClassifier(input_matrix, 0.93000000000000005, 0.92000000000000004)
the data is from the kaggle competition ghouls, goblins and ghost
and here is the notebook
The issue is caused by the p_train[target_var], which should be a 1-D array but panda dataframe is 2D array-like data structure. Change the tpot.fit() codes as below will solve the input issue.
tpot.fit(pd.np.array(p_train[train_cols]), pd.np.array(p_train[target_var]).ravel())
@weixuanfu2016 thanks! it worked like a charm
(sorry for the late response)
Most helpful comment
The issue is caused by the
p_train[target_var], which should be a 1-D array but panda dataframe is 2D array-like data structure. Change the tpot.fit() codes as below will solve the input issue.