In my code, the DF locally is:
df.groupby('SUCCESS').count()['id']
yields:
SUCCESS
0 341
1 1041
Name: id, dtype: int64
Only two classes, no nulls.
Maybe ambitious of me to think that the uploaded dataframe might reflect that but:
filename = 'myfilename.csv'
container_client.upload_blob(filename,df.to_csv(index=False),overwrite=True)
dataset = Dataset.Tabular.from_delimited_files(
path=[(datastore, filename)]
)
training_data, validation_data = dataset.random_split(percentage=0.8, seed=42)
print(training_data.to_pandas_dataframe().shape)
training_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']
The output of that group by is:
SUCCESS
0.00 274
1.00 831
50.00 4
It looks like it took 4 cases and changed them to .5
testing set also has this problem
validation_data.to_pandas_dataframe().groupby("SUCCESS").count()['id']
yields:
SUCCESS
0.00 67
1.00 210
50.00 1
Can anyone think of a valid reason why the dataset _should_ change the value of the target variable? Is there some kind of trick to get the dataframe to match the dataframe that was uploaded?
It looks as if the dataprep library is interpreting what you'd like to be an integer column as a float column, then imputing missing values using MeanImputer or something (because the mean of 0 and 1 is 0.5?).
My suggestion is to use the set_column_types param to force the column to be an integer column.
There aren't any missing values. I suppose I should have added it but I'm using
df = get_tables.run_query('training_query_cut').dropna(subset=['SUCCESS'])
two spitball things to try.
dataset = Dataset.Tabular.from_delimited_files(
path=[(datastore, filename)],
support_multi_line=True
)
from azureml.data.dataset_factory import DataType
dataset = Dataset.Tabular.from_delimited_files(
path=[(datastore, filename)],
set_column_types={'SUCCESS':DataType().to_long()}
)
Def looks like it has something to do with how datasets reads files, or how it saves them. Looks like a \n in a malplaced text string is causing it. further column removal just solved the problem.
Closing this out as I'm not 100% sure this isn't just a 'me' problem. I think you could make an argument that the from_delimited_files should have read the same way the file was written. However I think you can make an argument that it should be the way it is. Either way for future viewers the way to test this is to look at the:
dataset.to_pandas_dataframe().groupby("SUCCESS").count()['id']
in order to confirm that the numbers are the same BEFORE splitting them for your experiment.