Machinelearningnotebooks: Dataset.Tabular.from_delimited_files creating an extra class

Created on 31 Aug 2020 · 5Comments · Source: Azure/MachineLearningNotebooks

In my code, the DF locally is:

df.groupby('SUCCESS').count()['id']

yields:

SUCCESS
0     341
1    1041
Name: id, dtype: int64

Only two classes, no nulls.

Maybe ambitious of me to think that the uploaded dataframe might reflect that but:

filename = 'myfilename.csv'
container_client.upload_blob(filename,df.to_csv(index=False),overwrite=True)

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)]
)

training_data, validation_data = dataset.random_split(percentage=0.8, seed=42)
print(training_data.to_pandas_dataframe().shape)
training_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']

The output of that group by is:

SUCCESS
0.00     274
1.00     831
50.00      4

It looks like it took 4 cases and changed them to .5

testing set also has this problem

validation_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']

yields:

SUCCESS
0.00      67
1.00     210
50.00      1

Can anyone think of a valid reason why the dataset _should_ change the value of the target variable? Is there some kind of trick to get the dataframe to match the dataframe that was uploaded?

Training product-question

Source

BillmanH

All 5 comments

It looks as if the dataprep library is interpreting what you'd like to be an integer column as a float column, then imputing missing values using MeanImputer or something (because the mean of 0 and 1 is 0.5?).

My suggestion is to use the set_column_types param to force the column to be an integer column.

swanderz on 31 Aug 2020

There aren't any missing values. I suppose I should have added it but I'm using

df = get_tables.run_query('training_query_cut').dropna(subset=['SUCCESS'])

BillmanH on 31 Aug 2020

👍1

two spitball things to try.

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)],
    support_multi_line=True
)

from azureml.data.dataset_factory import DataType

dataset = Dataset.Tabular.from_delimited_files(
    path=[(datastore, filename)],
    set_column_types={'SUCCESS':DataType().to_long()}
)

swanderz on 31 Aug 2020

Def looks like it has something to do with how datasets reads files, or how it saves them. Looks like a \n in a malplaced text string is causing it. further column removal just solved the problem.

BillmanH on 1 Sep 2020

Closing this out as I'm not 100% sure this isn't just a 'me' problem. I think you could make an argument that the from_delimited_files should have read the same way the file was written. However I think you can make an argument that it should be the way it is. Either way for future viewers the way to test this is to look at the:

dataset.to_pandas_dataframe().groupby("SUCCESS").count()['id']

in order to confirm that the numbers are the same BEFORE splitting them for your experiment.

BillmanH on 1 Sep 2020

👀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Error - Installing SDK for Python

tylercmsft · 4Comments

train-hyperparameter-tune-deploy-with-keras : deployment failing

nswitanek · 4Comments

pass specific files as parameter

shankarpandala · 3Comments

Uploading and registering a dataset overwrites the previous versions

lefaivre · 5Comments

Azure Machine Learning- Triggered Pipeline does not Execute the Python Script

AakanchJoshi · 4Comments