Machinelearningnotebooks: Dataset.take(1) returning Buffer too small

Created on 13 Mar 2020 · 10Comments · Source: Azure/MachineLearningNotebooks

I'm trying to fetch the column header of a dataset registered in my workspace. Here are the steps I'm following:

Define a datastore
Access the dataset by name and specify the version
Take a single row as a sample
Convert sample into pandas data frame
Call df.columns

test_data = Dataset.get_by_name(ws, "testing_data_prep", 1)
dataset_header = training_dataset.take(1).to_pandas_dataframe()
dataset_cols = dataset_header.columns

I have a wide dataset in hand (roughly 5000 columns), so I'm using .take(1) before I swap it into a pandas dataframe (I assumed that loading up a single record would be way more efficient that loading up the entire dataset).

In this case above, .take(1) returns an error (full Traceback at the very bottom)

azureml.data.dataset_error_handling.DatasetExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=

What's weird enough is that if I remove the .take(1) and try to_pandas_dataframe() on a smaple dataset (same number of columns, 100 records), it takes up a second or two but it works and returns the columns header.

Any idea what's going on there?

Traceback

Traceback (most recent call last):
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\dataset_error_handling.py", line 83, in _try_execute
return action()
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\tabular_dataset.py", line 146, in
out_of_range_datetime=out_of_range_datetime))
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api_loggerfactory.py", line 149, in wrapper
return func(args, *kwargs)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\dataflow.py", line 708, in to_pandas_dataframe
ExecuteAnonymousActivityMessageArguments(anonymous_activity=Dataflow._dataflow_to_anonymous_activity_data(dataflow_to_execute)))
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api_aml_helper.py", line 38, in wrapper
return send_message_func(op_code, message, cancellation_token)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\engineapi\api.py", line 94, in execute_anonymous_activity
response = self._message_channel.send_message('Engine.ExecuteActivity', message_args, cancellation_token)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\engineapi\engine.py", line 120, in send_message
raise_engine_error(response['error'])
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\dataprep\api\errorhandlers.py", line 24, in raise_engine_error
raise ExecutionError(error_response)
azureml.dataprep.api.errorhandlers.ExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=3901bb11-05dc-4fd2-8d79-656071bfb8bf
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\miniconda3\envs\creditscore\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
a.to_pandas_dataframe()
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data_loggerfactory.py", line 78, in wrapper
return func(args, *kwargs)
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\tabular_dataset.py", line 145, in to_pandas_dataframe
df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
File "C:\miniconda3\envs\creditscore\lib\site-packages\azureml\data\dataset_error_handling.py", line 85, in _try_execute
raise DatasetExecutionError(str(e))
azureml.data.dataset_error_handling.DatasetExecutionError: (Column 1: In chunk 0: Invalid: Buffer #1 too small in array of type bool and length 1: expected at least 1 byte(s), got 0)|session_id=

Data4ML Machine Learning cxp product-question triaged

Source

jadhosn

Most helpful comment

@jadhosn 1.1.5 is stable and should also work (dataprep dependency will be 1.3.4 - which is also fine)
The other option if you want to stay on 1.0.85 is to downgrade pyarrow to 0.15 as it's new check they introduced...
Third option is to read more than 8 rows which should also work just fine (satisfies pyarrow check :) ) Hope this helps

anliakho2 on 16 Mar 2020

🚀2

All 10 comments

@anliakho2 @shuyums2

skasturi on 14 Mar 2020

@jadhosn Thanks for the question. We are investigating the issue and will update you shortly.

MohitGargMSFT on 14 Mar 2020

hi @jadhosn ,
when you do ds.to_pandas_dataframe(), does it return only the header or the entire dataframe correctly?

you can run pip list to check your package version. will be helpful to update us with your azureml-dataprep, azureml-sdk, pyarrow version for investigation.

Thanks!

MayMSFT on 16 Mar 2020

@jadhosn This is a bug that was recently fixed (assuming you are using pyarrow version 0.16)
Please install azureml-dataprep 1.3.5 to pick up the fix

anliakho2 on 16 Mar 2020

@anliakho2 Yup, I'm using pyarrow=0.16.0. I'll move towards dataprep 1.3.5 and let you know how that works.

@MayMSFT, it doesn't return anything even with take(1).

Name Version Build Channel
azureml-automl-core 1.0.85.5 pypi_0 pypi
azureml-automl-runtime 1.0.85.5 pypi_0 pypi
azureml-core 1.0.85.5 pypi_0 pypi
azureml-dataprep 1.1.38 pypi_0 pypi
azureml-dataprep-native 13.2.0 pypi_0 pypi
azureml-defaults 1.0.85.1 pypi_0 pypi
azureml-explain-model 1.0.85 pypi_0 pypi
azureml-interpret 1.0.85 pypi_0 pypi
azureml-pipeline 1.0.85 pypi_0 pypi
azureml-pipeline-core 1.0.85.1 pypi_0 pypi
azureml-pipeline-steps 1.0.85 pypi_0 pypi
azureml-sdk 1.0.85 pypi_0 pypi
azureml-telemetry 1.0.85.2 pypi_0 pypi
azureml-train 1.0.85 pypi_0 pypi
azureml-train-automl 1.0.85 pypi_0 pypi
azureml-train-automl-client 1.0.85.4 pypi_0 pypi
azureml-train-automl-runtime 1.0.85.5 pypi_0 pypi
azureml-train-core 1.0.85 pypi_0 pypi
azureml-train-restclients-hyperdrive 1.0.85 pypi_0 pypi

jadhosn on 16 Mar 2020

@anliakho2 are we sure that dataprep 1.3.5 plays nice with 1.0.85? right off the bat, moving towards dataprep=1.3.5 returns some versions mismatches (see below)

ERROR: azureml-train-automl 1.0.85 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-train-automl-runtime 1.0.85.5 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-train-automl-runtime 1.0.85.5 has requirement pandas<=0.23.4,>=0.21.0, but you'll have pandas 1.0.1 which is incompatible.
ERROR: azureml-train-automl-client 1.0.85.4 has requirement azureml-dataprep<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-sdk 1.0.85 has requirement azureml-dataprep[fuse]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-automl-runtime 1.0.85.5 has requirement azureml-dataprep[fuse,pandas]<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.
ERROR: azureml-automl-runtime 1.0.85.5 has requirement pandas<=0.23.4,>=0.21.0, but you'll have pandas 1.0.1 which is incompatible.
ERROR: azureml-automl-core 1.0.85.5 has requirement azureml-dataprep<1.2.0a,>=1.1.37a, but you'll have azureml-dataprep 1.3.5 which is incompatible.

jadhosn on 16 Mar 2020

Yeah, sorry though that you are already on latest of azureml-sdk (1.1.5) you better upgrade to azureml-sdk 1.1.5.1, that for sure would work.
Or there is a reason why you want to stay behind on azureml-sdk 1.0.85?

anliakho2 on 16 Mar 2020

@anliakho2 Oh, it's been a bit since I checked the release notes page. We wanted to stick with 1.0.85 as it was the most recent stable version. I guess with 1.1.5.1 not being an rc, we might consider moving up. Just in case we stick with 1.0.85 (as we've developed quite a bit on that sdk version), the buffer is too small doesn't have a fix unless we move up?

jadhosn on 16 Mar 2020

anliakho2 on 16 Mar 2020

🚀2

Thank you @anliakho2 for providing the different options. That was very helpful.

jadhosn on 16 Mar 2020

🚀1

Was this page helpful?

0 / 5 - 0 ratings