In the following merge.py script, used by the nyc-taxi-data-regression-model-building notebook demonstrating azure ml pipelines,
the code cleansed_green_data = run.input_datasets["cleansed_green_data"] imports the dataset (printing cleansed_green_data shows dataset attributes).
However when converting to pandas dataframe with green_df = cleansed_green_data.to_pandas_dataframe() results in an empty dataframe (same with yellow_data). I have tried similar code in my own notebook using parquet files persisted to a dataset, and same issue with run.input_dataset and to_pandas_dataframe().
Using Azure storage explorer I can see the parquet storage blob has content (>0 kb size). Most grateful if able to advise, particularly if you are able to get dataframe with content when executing example notebook/supporting scripts. Thanks!
@corticalstack Thank you for bringing this to our attention. I am investigating the issue and will update here with my findings.
@corticalstack Currently there is a bug in azureml-dataprep 2.4.* where in it cannot read the parquet file from the Azure workspaceblobstore. dataprep 2.4.* maps to azureml-sdk 1.17.* and higher. There are 2 ways to workaround it.
The bug is being fixed and should be rolled out with the new release of azureml-dataprep 2.7*.
@shbijlan Many thanks for the update.
Can you provide example of workaround proposal 2 to clarify please?
Any estimate on release of azureml-dataprep 2.7* ?
reopening until fixed in latest azureml-sdk
@corticalstack it looks like azureml-sdk==1.20.0 will release on pypi on December 21st, which will contain azureml-dataprep>=2.7.0 - will comment back once confirmed
editing instead: additionally, azureml-dataprep==2.7.0 will be released on pypi today and you should be able to use it. it will automatically be used in azureml-sdk==1.20.0
@corticalstack an example of the second workaround would be removing 'file_extension=None' from nyc-taxi-data notebook
Use:
inputs=[cleansed_green_data.parse_parquet_files(),
cleansed_yellow_data.parse_parquet_files()]
instead of:
inputs=[cleansed_green_data.parse_parquet_files(file_extension=None), cleansed_yellow_data.parse_parquet_files(file_extension=None)]
Most helpful comment
reopening until fixed in latest
azureml-sdk@corticalstack it looks like
azureml-sdk==1.20.0will release on pypi on December 21st, which will containazureml-dataprep>=2.7.0-will comment back once confirmedediting instead: additionally,
azureml-dataprep==2.7.0will be released on pypi today and you should be able to use it. it will automatically be used inazureml-sdk==1.20.0