Tpot: Keep Hitting Data not formatted properly

Created on 13 Nov 2016  路  4Comments  路  Source: EpistasisLab/tpot

I am trying out tpot.fit with kaggle playground competition data "Ghouls, Goblins, and Ghosts... Boo!"
and it keeps hitting

ValueError: There was an error in the TPOT optimization process. This could be because the data was
 not formatted properly, or because data for a regression problem was provided to the TPOTClassifier
 object. Please make sure you passed the data to TPOT correctly.

The features looks like these

    combined_var    rotting_flesh   bone_length     has_soul    hair_length
219     0.531771    0.415485    0.489446    0.503475    0.825953
262     0.387865    0.497112    0.209997    0.327212    0.642451
6   0.434517    0.568952    0.399331    0.467901    0.618391
359     0.421345    0.172182    0.626017    0.644941    0.408422
220     0.223747    0.648866    0.168909    0.255440    0.303927

and the target looks like

    type
219     0.0
262     1.0
6   1.0
359     0.0
220     2.0

these are the two lines I am using

tpot = TPOTClassifier(verbosity=3)
tpot.fit(p_train[train_cols], p_train[target_var])

p_train is a dataframe object.

I checked the code, its not able to generate an optimized pipeline and here is the traceback

ValueError                                Traceback (most recent call last)
<ipython-input-24-670421e6f822> in <module>()
----> 1 tpot.fit(p_train[train_cols], p_train[target_var])

/home/jayant/anaconda3/lib/python3.5/site-packages/tpot/base.py in fit(self, features, classes)
    354 
    355                 if not self._optimized_pipeline:
--> 356                     raise ValueError('There was an error in the TPOT optimization '
    357                                      'process. This could be because the data was '
    358                                      'not formatted properly, or because data for '

ValueError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOT
question

Most helpful comment

The issue is caused by the p_train[target_var], which should be a 1-D array but panda dataframe is 2D array-like data structure. Change the tpot.fit() codes as below will solve the input issue.

tpot.fit(pd.np.array(p_train[train_cols]), pd.np.array(p_train[target_var]).ravel())

All 4 comments

Could you please let us know more details about the codes for inputting the dataset?

Could you please also try the codes below?

import numpy as np
...
tpot.fit(np.array(p_train[train_cols]), np.array(p_train[target_var]))

@weixuanfu2016 thanks for replying

I tried what you suggested and still got the same error. The program runs upto 100% but in all generations

I get

Generation 100 - Current Pareto front scores:
5000    inf GradientBoostingClassifier(input_matrix, 0.93000000000000005, 0.92000000000000004)

the data is from the kaggle competition ghouls, goblins and ghost

and here is the notebook

The issue is caused by the p_train[target_var], which should be a 1-D array but panda dataframe is 2D array-like data structure. Change the tpot.fit() codes as below will solve the input issue.

tpot.fit(pd.np.array(p_train[train_cols]), pd.np.array(p_train[target_var]).ravel())

@weixuanfu2016 thanks! it worked like a charm
(sorry for the late response)

Was this page helpful?
0 / 5 - 0 ratings