Machinelearningnotebooks: Run input_datasets parquet to dataframe results in empty dataframe

Created on 4 Nov 2020  路  5Comments  路  Source: Azure/MachineLearningNotebooks

In the following merge.py script, used by the nyc-taxi-data-regression-model-building notebook demonstrating azure ml pipelines,
the code cleansed_green_data = run.input_datasets["cleansed_green_data"] imports the dataset (printing cleansed_green_data shows dataset attributes).

However when converting to pandas dataframe with green_df = cleansed_green_data.to_pandas_dataframe() results in an empty dataframe (same with yellow_data). I have tried similar code in my own notebook using parquet files persisted to a dataset, and same issue with run.input_dataset and to_pandas_dataframe().

Using Azure storage explorer I can see the parquet storage blob has content (>0 kb size). Most grateful if able to advise, particularly if you are able to get dataframe with content when executing example notebook/supporting scripts. Thanks!

https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/nyc-taxi-data-regression-model-building/scripts/prepdata/merge.py

Data4ML Pipelines product-issue

Most helpful comment

reopening until fixed in latest azureml-sdk

@corticalstack it looks like azureml-sdk==1.20.0 will release on pypi on December 21st, which will contain azureml-dataprep>=2.7.0 - will comment back once confirmed

editing instead: additionally, azureml-dataprep==2.7.0 will be released on pypi today and you should be able to use it. it will automatically be used in azureml-sdk==1.20.0

All 5 comments

@corticalstack Thank you for bringing this to our attention. I am investigating the issue and will update here with my findings.

@corticalstack Currently there is a bug in azureml-dataprep 2.4.* where in it cannot read the parquet file from the Azure workspaceblobstore. dataprep 2.4.* maps to azureml-sdk 1.17.* and higher. There are 2 ways to workaround it.

  1. if possible, use older version of azureml-sdk 1.16.* or less which pins to lower version of azureml-dataprep(2.3.* or less).
  2. if running using python script, you can alter the call to "read parquet file" with no input. By default it falls back to parquet in that case.

The bug is being fixed and should be rolled out with the new release of azureml-dataprep 2.7*.

@shbijlan Many thanks for the update.

Can you provide example of workaround proposal 2 to clarify please?

Any estimate on release of azureml-dataprep 2.7* ?

reopening until fixed in latest azureml-sdk

@corticalstack it looks like azureml-sdk==1.20.0 will release on pypi on December 21st, which will contain azureml-dataprep>=2.7.0 - will comment back once confirmed

editing instead: additionally, azureml-dataprep==2.7.0 will be released on pypi today and you should be able to use it. it will automatically be used in azureml-sdk==1.20.0

@corticalstack an example of the second workaround would be removing 'file_extension=None' from nyc-taxi-data notebook
Use:
inputs=[cleansed_green_data.parse_parquet_files(),
cleansed_yellow_data.parse_parquet_files()]
instead of:
inputs=[cleansed_green_data.parse_parquet_files(file_extension=None), cleansed_yellow_data.parse_parquet_files(file_extension=None)]

Was this page helpful?
0 / 5 - 0 ratings