Query fails when table schema is defined as string but partition schema is decimal. Error message below:
Query failed (#1111): There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'aaaa' in table 'ttttt' is declared as type 'string', but partition ‘pppp’ declared column 'aaaa' as type 'decimtal(4,0)'. [DB Errorcode=16777224]
The Hive Metastore has this column aaaa defined as type string but older partitions (like pppp) of parquet schema has that column defined as type decimal. Hive is able to read from the older partitions.
Ended up in this situation as the field type had to be changed from decimal to string, for business reasons.
Can Presto cast decimal to type string in above situation to return a result? (Not asking for string to decimal casting)
The Hive connector should support any conversion that is supported consistently in Hive (certain conversions work different for different file formats which is harder). Any numeric type to string should work. It starts with HiveCoercionPolicy and likely needs to actually be implemented elsewhere.
This commit is a good starting point to see what needs to be changed and where the tests go: https://github.com/prestosql/presto/commit/f0eb4f32f34d2e12eb1bdc0913b3a0602bbac642
decimtal instead of decimal ? @anirbanch
Is anyone looking into this issue?
Also 1 qq, why Presto do check for partition metadata for all columns? For the above reported issue, user is not using any problematic column in the query anywhere but still Presto reports an issue as it compares the full table/partition metadata. We noticed that hive also reports the issue for same column but only when the column is used in the query.
Now since the data type was changed for only 1-2 columns unsupported by Presto, all our users who are accessing this table are paying the penalty as none of our users are able to query this table now. This is our detail table which is used very heavily.
Until this issue is fixed, as a workaround, we are thinking of creating another external table with the old structure on the older set of partitions or historical data specific to our Presto users to unblock them but would be great if the code can be fixed ASAP to support this.
The reason is that Presto connectors need to return metadata for all columns in the table. The engine doesn’t ask for metadata about specific columns.
The table/partition mismatch is actually a different issue. This happens at runtime during scheduling and it might be possible to only do it for used columns, but I don’t know the complexity of that change.
Is there a workaround to this problem? We are getting the same issue:
There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'equipments' in table 'xxxxx' is declared as type 'array<struct<action:string,devicename:string,devicetype:string,model:string,outlet:string,serialnumber:string,isplaceholder:boolean,reason:string,categorycode:string,reasoncategory:string,reasoncode:string,otherreason:string>>', but partition 'scheduledate=2020-01-10' declared column 'equipments' as type 'array<struct<action:string,deviceName:string,deviceName:string,deviceType:string,model:string,outlet:string,serialNumber:string,isPlaceHolder:boolean,reason:string,categoryCode:string,reasonCategory:string,reasonCode:string,otherReason:string>>'.
All we are trying to do is to get the data for the previous day of the current date using code WHERE date(SCHEDULEDATE) = date_add('day', -1, current_date)
We are not sure why it's looking into scheduledate=2020-01-10 and we are not even interested in column equipments.
I'm willing to pick this up starting the coming weekend. This is a pain point for me too.
The existing workarounds are to create a new external table with the problematic column removed but it would be nice to support such conversions where possible.
@electrum As a first step will allowing coercion from any numeric type to string be useful? I've started work on this in https://github.com/prestosql/presto/pull/4426.
Having Presto be silent in case of schema mismatch on an unused column would require changes to the SPI as you said and I'm not sure it's a better decision than failing earlier with a clear message rather than people discovering the issue so late that they cannot even fix it.
Our general rule for the Hive connector is that it should have the same behavior as Hive. If Hive can read the data in this evolution case, then we should as well (returning the same result).
Thanks @hashhar and @electrum . Is it ready to be merged to master?
@degoyal No. I hadn't been able to spend a lot of time on this. Pending things are to verify if Hive allows implicit conversion between ANY numeric type and Strings. If someone could create a matrix of supported conversions it would help since I don't have a Hive setup handy and only devote weekends to this.
For Hive setup, you can use this.
Hive 1.2
./presto-product-tests-launcher/bin/run-launcher env up --environment singlenode --without-presto
Hive 3.1
./presto-product-tests-launcher/bin/run-launcher env up --environment singlenode-hdp3 --without-presto