Machinelearningnotebooks: Dataset.Tabular.from_delimited_files creating an extra class

Created on 31 Aug 2020  路  5Comments  路  Source: Azure/MachineLearningNotebooks

In my code, the DF locally is:

df.groupby('SUCCESS').count()['id']

yields:

SUCCESS
0     341
1    1041
Name: id, dtype: int64

Only two classes, no nulls.

Maybe ambitious of me to think that the uploaded dataframe might reflect that but:

filename = 'myfilename.csv'
container_client.upload_blob(filename,df.to_csv(index=False),overwrite=True)

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)]
)

training_data, validation_data = dataset.random_split(percentage=0.8, seed=42)
print(training_data.to_pandas_dataframe().shape)
training_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']

The output of that group by is:

SUCCESS
0.00     274
1.00     831
50.00      4

It looks like it took 4 cases and changed them to .5

testing set also has this problem

validation_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']

yields:

SUCCESS
0.00      67
1.00     210
50.00      1

Can anyone think of a valid reason why the dataset _should_ change the value of the target variable? Is there some kind of trick to get the dataframe to match the dataframe that was uploaded?

Training product-question

All 5 comments

It looks as if the dataprep library is interpreting what you'd like to be an integer column as a float column, then imputing missing values using MeanImputer or something (because the mean of 0 and 1 is 0.5?).

My suggestion is to use the set_column_types param to force the column to be an integer column.

There aren't any missing values. I suppose I should have added it but I'm using

df = get_tables.run_query('training_query_cut').dropna(subset=['SUCCESS'])

two spitball things to try.

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)],
    support_multi_line=True
)
from azureml.data.dataset_factory import DataType

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)],
    set_column_types={'SUCCESS':DataType().to_long()}
)

Def looks like it has something to do with how datasets reads files, or how it saves them. Looks like a \n in a malplaced text string is causing it. further column removal just solved the problem.

Closing this out as I'm not 100% sure this isn't just a 'me' problem. I think you could make an argument that the from_delimited_files should have read the same way the file was written. However I think you can make an argument that it should be the way it is. Either way for future viewers the way to test this is to look at the:

dataset.to_pandas_dataframe().groupby("SUCCESS").count()['id']

in order to confirm that the numbers are the same BEFORE splitting them for your experiment.

Was this page helpful?
0 / 5 - 0 ratings