Azure-docs: Polybase loading Parquet file error: class java.lang.Long cannot be cast to class parquet.io.api.Binary

Created on 17 Apr 2019 · 12Comments · Source: MicrosoftDocs/azure-docs

See here for full details https://stackoverflow.com/q/55730926/6067848, but basically I have a dead-simple Parquet file in Azure Blob and I when I follow the steps above to load it via PolyBase and T-SQL, I get a conversion error for any numeric data: class java.lang.Long cannot be cast to class parquet.io.api.Binary. While it might be a problem with the file, the file works with other tools, so it seems to be an issue with Polybase. Please help! Driving me nuts.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 437b3c1e-b43f-360d-fb6f-27bb66b29a55
Version Independent ID: 96509a9c-d258-8956-a219-02d188c16d9b
Content: Instead of ETL, design ELT for Azure SQL Data Warehouse
Content Source: articles/sql-data-warehouse/design-elt-data-loading.md
Service: sql-data-warehouse
Sub-service: design
GitHub Login: @kevinvngo
Microsoft Alias: kevin

cxp desigsubsvc product-issue sql-data-warehoussvc triaged

Source

yehoshuadimarsky

Most helpful comment

Thank you for this additional detail. HADOOP is the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.

Hate to sound obnoxious here but why do I need to pay for a support plan for this? Microsoft offers a product (PolyBase via Azure SQL Data Warehouse) that is not working as specified, seems there is a defect in the product. So why as a paying customer do I need to pay extra simply to get the product to work?

yehoshuadimarsky on 19 Apr 2019

👍3

All 12 comments

@yehoshuadimarsky Thank you very much for your interest in Azure cloud services. I responded to the SO as for the specific CREATE EXTERNAL DATA SOURCE but it would be good to see your other two necessary T-SQL for FILE FORMAT and EXTERNAL TABLE. So, in total, you should have run 3 separate T-SQL, such as:

Create an external data source:

-- PolyBase only: Azure Storage Blob as data source
-- (on SQL Server 2016)
CREATE EXTERNAL DATA SOURCE data_source_name
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
[, CREDENTIAL = credential_name ]
)
[;]

Define file format with compression library (SnappyCodec works most of the time)

-- Create an external file format for PARQUET files.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = PARQUET
[ , DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
| 'org.apache.hadoop.io.compress.GzipCodec' }
]);

And finally define the table where you want the data to go, if you wish to represent the data in an external table for further ELT activity:

-- Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
( [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[ , [ ,...n ] ]
)
[;]

The following tutorial might be helpful to walk you through the process of creating an external table(s) and loading compressed data sources. Load New York Taxicab data to Azure SQL Data Warehouse

I suspect that either the column schema for the affected data or the compression library used in the CREATE EXTERNAL FILE FORMAT T-SQL is not compatible.

Mike-Ubezzi-MSFT on 17 Apr 2019

Thanks. I just updated the SQL code in the SO question. And I in fact quoted the NYC taxi cabs tutorial in my SO question.

yehoshuadimarsky on 17 Apr 2019

Also, I verified that the Parquet file is fine and works - I tested by reading it into Databricks/Spark, worked fine. So it seems to be an issue with PolyBase.

yehoshuadimarsky on 17 Apr 2019

Thank you for this additional detail. HADOOP is the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.

Mike-Ubezzi-MSFT on 17 Apr 2019

Ok thanks.

yehoshuadimarsky on 17 Apr 2019

One thing to check:

This issue has been resolved by Miro who checked the file and discovered some data types in parquet file are not matching the data.

Further checking with customer loading process we discovered that the order of columns are not matching from source to SQLDW.

So, similar issue (this is an issue from the past where the resolution was...) but check to see if the schema in the file matches the target table schema.

Mike-Ubezzi-MSFT on 17 Apr 2019

The schema does match. Use my code on Stack Overflow to reproduce - very simple schema, 1 string and 1 float64. Run that code, it doesn't work.

yehoshuadimarsky on 17 Apr 2019

👍1

Thank you for this additional detail. HADOOP is the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.

yehoshuadimarsky on 19 Apr 2019

👍3

@yehoshuadimarsky I believe support can make that assessment and credit your support plan accordingly.
I can still enable a free support incident if you wish to go that route.

Mike-Ubezzi-MSFT on 19 Apr 2019

Yes thanks let's start with the free support incident and go from there.

yehoshuadimarsky on 19 Apr 2019

Thank you for this additional detail. HADOOP is the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.

Hate to sound obnoxious here but why do I need to pay for a support plan for this? Microsoft offers a product (PolyBase via Azure SQL Data Warehouse) that is not working as specified, seems there is a defect in the product. So why as a paying customer do I need to pay extra simply to get the product to work?

I need you to send me your Subscription ID to AzCommunity and I will send you instructions to get this created with support. Thanks!

We will now proceed to close this thread. If there are further questions regarding this matter, please comment and we will gladly continue the discussion.

Mike-Ubezzi-MSFT on 19 Apr 2019

I have the same issue