Tpot: Titanic example -problem with 2nd last cell.

Created on 5 Jun 2017  路  14Comments  路  Source: EpistasisLab/tpot

Hi all!
Want to enter in the automl comp.
Trying out the titanic example to get some familiarity with the software.
Running into some trouble with the above cell.
using python 3.6 on a linux machine.

screenshot_2017-06-05_10-59-23

question

All 14 comments

screenshot_2017-06-05_10-59-23

Can you please copy and paste the full stack trace from the ValueError? It looks like it's having issues reading the Titanic training data.


ValueError Traceback (most recent call last)
in ()
6
7 # NOTE: Make sure that the class is labeled 'class' in the data file
----> 8 tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
9 features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
10 training_features, testing_features, training_classes, testing_classes = train_test_split(features, tpot_data['class'], random_state=42)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in recfromcsv(fname, *kwargs)
2044 kwargs.setdefault("delimiter", ",")
2045 kwargs.setdefault("dtype", None)
-> 2046 output = genfromtxt(fname, *
kwargs)
2047
2048 usemask = kwargs.get("usemask", False)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows)
1826 # Raise an exception ?
1827 if invalid_raise:
-> 1828 raise ValueError(errmsg)
1829 # Issue a warning ?
1830 else:

ValueError: Some errors were detected !
Line 2 (got 13 columns instead of 12)
...
Line 892 (got 13 columns instead of 12)

Cheers Randy, the above is the full error.
I want to try tpot on the numerai dataset.
Many thanks,
best,
Andrew

great piece of software :)!
Best,
Andrew

It does indeed look like it's an issue reading the dataset. Specifically, numpy's np.recfromcsv function is detecting that there are 12 columns in the Titanic dataset (correct) but thinks there are 13 columns in several of the rows. Are you working on a copy of the Titanic dataset directly from our tutorials directory?

Yer, i am using the data in the example.
I might be that my numpy is just not up to date?
Will update numpy and re run through the example.

NOTE: Make sure that the class is labeled 'class' in the data file

tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)

the tpot data file is correct?

I see the problem now. We're using numpy's recfromcsv to read the file in, and telling is that the delimiter is a comma ,. The problem arises when we have strings containing names in the data, as the names have commas in them. Thus, recfromcsv thinks there are 13 columns when there are in fact 12.

pandas.read_csv is smart enough to handle this situation, but apparently recfromcsv isn't. @weixuanfu2016 / @teaearlgraycold, maybe we should go back to using pandas to read the files in again? I don't think that pandas is that heavy of a dependency, and apparently the numpy data file reading functions are pretty inflexible.

In the meantime, @AIAdventures, you can change that code to use pandas:

import pandas as pd

tpot_data = pd.read_csv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv')
features = tpot_data.drop('class', axis=1).values
training_features, testing_features, training_classes, testing_classes = 
                        train_test_split(features, tpot_data['class'].values, random_state=42)

@rhiever I think we could go back to using pandas. If we use TFlearn in the future version of TPOT, the tflearn.data_utils.load_csv can be a good alternative.

yer, from my experience pandas data frames are more reliable than numpy arrays.
more of a focused product.
Thankyou for your help, i am now going to try the tool with the numerai dataset.
Many thanks,
Best,
Andrew

Great, please feel free to reopen the issue if you have any other questions!

Was this page helpful?
0 / 5 - 0 ratings