I'm trying to fetch the column header of a dataset registered in my workspace. Here are the steps I'm following:
test_data = Dataset.get_by_name(ws, "testing_data_prep", 1)
dataset_header = training_dataset.take(1).to_pandas_dataframe()
dataset_cols = dataset_header.columns
I have a wide dataset in hand (roughly 5000 columns), so I'm using .take(1) before I swap it into a pandas dataframe (I assumed that loading up a single record would be way more efficient that loading up the entire dataset).
In this case above, .take(1) returns an error (full Traceback at the very bottom)
azureml.data.dataset_error_handling.DatasetExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=
What's weird enough is that if I remove the .take(1) and try to_pandas_dataframe() on a smaple dataset (same number of columns, 100 records), it takes up a second or two but it works and returns the columns header.
Any idea what's going on there?
Traceback (most recent call last):
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\dataset_error_handling.py", line 83, in _try_execute
return action()
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\tabular_dataset.py", line 146, in
out_of_range_datetime=out_of_range_datetime))
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api_loggerfactory.py", line 149, in wrapper
return func(args, *kwargs)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\dataflow.py", line 708, in to_pandas_dataframe
ExecuteAnonymousActivityMessageArguments(anonymous_activity=Dataflow._dataflow_to_anonymous_activity_data(dataflow_to_execute)))
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api_aml_helper.py", line 38, in wrapper
return send_message_func(op_code, message, cancellation_token)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\engineapi\api.py", line 94, in execute_anonymous_activity
response = self._message_channel.send_message('Engine.ExecuteActivity', message_args, cancellation_token)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\engineapi\engine.py", line 120, in send_message
raise_engine_error(response['error'])
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\errorhandlers.py", line 24, in raise_engine_error
raise ExecutionError(error_response)
azureml.dataprep.api.errorhandlers.ExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=3901bb11-05dc-4fd2-8d79-656071bfb8bf
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\miniconda3\envs\creditscore\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
a.to_pandas_dataframe()
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data_loggerfactory.py", line 78, in wrapper
return func(args, *kwargs)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\tabular_dataset.py", line 145, in to_pandas_dataframe
df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\dataset_error_handling.py", line 85, in _try_execute
raise DatasetExecutionError(str(e))
azureml.data.dataset_error_handling.DatasetExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=
@anliakho2 @shuyums2
@jadhosn Thanks for the question. We are investigating the issue and will update you shortly.
hi @jadhosn ,
when you do ds.to_pandas_dataframe(), does it return only the header or the entire dataframe correctly?
you can run pip list to check your package version. will be helpful to update us with your azureml-dataprep, azureml-sdk, pyarrow version for investigation.
Thanks!
@jadhosn This is a bug that was recently fixed (assuming you are using pyarrow version 0.16)
Please install azureml-dataprep 1.3.5 to pick up the fix
@anliakho2 Yup, I'm using pyarrow=0.16.0. I'll move towards dataprep 1.3.5 and let you know how that works.
@MayMSFT, it doesn't return anything even with take(1).
Name Version Build Channel
azureml-automl-core 1.0.85.5 pypi_0 pypi
azureml-automl-runtime 1.0.85.5 pypi_0 pypi
azureml-core 1.0.85.5 pypi_0 pypi
azureml-dataprep 1.1.38 pypi_0 pypi
azureml-dataprep-native 13.2.0 pypi_0 pypi
azureml-defaults 1.0.85.1 pypi_0 pypi
azureml-explain-model 1.0.85 pypi_0 pypi
azureml-interpret 1.0.85 pypi_0 pypi
azureml-pipeline 1.0.85 pypi_0 pypi
azureml-pipeline-core 1.0.85.1 pypi_0 pypi
azureml-pipeline-steps 1.0.85 pypi_0 pypi
azureml-sdk 1.0.85 pypi_0 pypi
azureml-telemetry 1.0.85.2 pypi_0 pypi
azureml-train 1.0.85 pypi_0 pypi
azureml-train-automl 1.0.85 pypi_0 pypi
azureml-train-automl-client 1.0.85.4 pypi_0 pypi
azureml-train-automl-runtime 1.0.85.5 pypi_0 pypi
azureml-train-core 1.0.85 pypi_0 pypi
azureml-train-restclients-hyperdrive 1.0.85 pypi_0 pypi
@anliakho2 are we sure that dataprep 1.3.5 plays nice with 1.0.85? right off the bat, moving towards dataprep=1.3.5 returns some versions mismatches (see below)
ERROR: azureml-train-automl 1.0.85 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-train-automl-runtime 1.0.85.5 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-train-automl-runtime 1.0.85.5 has requirement pandas<=0.23.4,>=0.21.0, but you'll have pandas 1.0.1 which is incompatible.
ERROR: azureml-train-automl-client 1.0.85.4 has requirement azureml-dataprep<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-sdk 1.0.85 has requirement azureml-dataprep[fuse]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-automl-runtime 1.0.85.5 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-automl-runtime 1.0.85.5 has requirement pandas<=0.23.4,>=0.21.0, but you'll have pandas 1.0.1 which is incompatible.
ERROR: azureml-automl-core 1.0.85.5 has requirement azureml-dataprep<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
Yeah, sorry though that you are already on latest of azureml-sdk (1.1.5) you better upgrade to azureml-sdk 1.1.5.1, that for sure would work.
Or there is a reason why you want to stay behind on azureml-sdk 1.0.85?
@anliakho2 Oh, it's been a bit since I checked the release notes page. We wanted to stick with 1.0.85 as it was the most recent stable version. I guess with 1.1.5.1 not being an rc, we might consider moving up. Just in case we stick with 1.0.85 (as we've developed quite a bit on that sdk version), the buffer is too small doesn't have a fix unless we move up?
@jadhosn 1.1.5 is stable and should also work (dataprep dependency will be 1.3.4 - which is also fine)
The other option if you want to stay on 1.0.85 is to downgrade pyarrow to 0.15 as it's new check they introduced...
Third option is to read more than 8 rows which should also work just fine (satisfies pyarrow check :) ) Hope this helps
Thank you @anliakho2 for providing the different options. That was very helpful.
Most helpful comment
@jadhosn 1.1.5 is stable and should also work (dataprep dependency will be 1.3.4 - which is also fine)
The other option if you want to stay on 1.0.85 is to downgrade pyarrow to 0.15 as it's new check they introduced...
Third option is to read more than 8 rows which should also work just fine (satisfies pyarrow check :) ) Hope this helps