See here for full details https://stackoverflow.com/q/55730926/6067848, but basically I have a dead-simple Parquet file in Azure Blob and I when I follow the steps above to load it via PolyBase and T-SQL, I get a conversion error for any numeric data: class java.lang.Long cannot be cast to class parquet.io.api.Binary. While it might be a problem with the file, the file works with other tools, so it seems to be an issue with Polybase. Please help! Driving me nuts.
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@yehoshuadimarsky Thank you very much for your interest in Azure cloud services. I responded to the SO as for the specific CREATE EXTERNAL DATA SOURCE but it would be good to see your other two necessary T-SQL for FILE FORMAT and EXTERNAL TABLE. So, in total, you should have run 3 separate T-SQL, such as:
Create an external data source:
-- PolyBase only: Azure Storage Blob as data source
-- (on SQL Server 2016)
CREATE EXTERNAL DATA SOURCE data_source_name
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
[, CREDENTIAL = credential_name ]
)
[;]
Define file format with compression library (SnappyCodec works most of the time)
-- Create an external file format for PARQUET files.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = PARQUET
[ , DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
| 'org.apache.hadoop.io.compress.GzipCodec' }
]);
And finally define the table where you want the data to go, if you wish to represent the data in an external table for further ELT activity:
-- Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
([ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[ ,[ ,...n ] ]
)
[;]
The following tutorial might be helpful to walk you through the process of creating an external table(s) and loading compressed data sources. Load New York Taxicab data to Azure SQL Data Warehouse
I suspect that either the column schema for the affected data or the compression library used in the CREATE EXTERNAL FILE FORMAT T-SQL is not compatible.
Thanks. I just updated the SQL code in the SO question. And I in fact quoted the NYC taxi cabs tutorial in my SO question.
Also, I verified that the Parquet file is fine and works - I tested by reading it into Databricks/Spark, worked fine. So it seems to be an issue with PolyBase.
Thank you for this additional detail. HADOOP is the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.
Ok thanks.
One thing to check:
This issue has been resolved by Miro who checked the file and discovered some data types in parquet file are not matching the data.
Further checking with customer loading process we discovered that the order of columns are not matching from source to SQLDW.
So, similar issue (this is an issue from the past where the resolution was...) but check to see if the schema in the file matches the target table schema.
The schema does match. Use my code on Stack Overflow to reproduce - very simple schema, 1 string and 1 float64. Run that code, it doesn't work.
Thank you for this additional detail.
HADOOPis the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.
Hate to sound obnoxious here but why do I need to pay for a support plan for this? Microsoft offers a product (PolyBase via Azure SQL Data Warehouse) that is not working as specified, seems there is a defect in the product. So why as a paying customer do I need to pay extra simply to get the product to work?
@yehoshuadimarsky I believe support can make that assessment and credit your support plan accordingly.
I can still enable a free support incident if you wish to go that route.
Yes thanks let's start with the free support incident and go from there.
Thank you for this additional detail.
HADOOPis the correct value for Databricks, Datalake Store, and Hadoop itself. Just need to check possibilities as there does seem to be an issue with PolyBase. Do you have an Azure Support Plan? If you do, please open a support request with Azure Support. If you do not have a support plan, please send your Subscription ID to AzCommunity and I will return instructions to have a support request created.Hate to sound obnoxious here but why do I need to pay for a support plan for this? Microsoft offers a product (PolyBase via Azure SQL Data Warehouse) that is not working as specified, seems there is a defect in the product. So why as a paying customer do I need to pay extra simply to get the product to work?
I need you to send me your Subscription ID to AzCommunity and I will send you instructions to get this created with support. Thanks!
We will now proceed to close this thread. If there are further questions regarding this matter, please comment and we will gladly continue the discussion.
I have the same issue
Most helpful comment
Hate to sound obnoxious here but why do I need to pay for a support plan for this? Microsoft offers a product (PolyBase via Azure SQL Data Warehouse) that is not working as specified, seems there is a defect in the product. So why as a paying customer do I need to pay extra simply to get the product to work?