Presto: unable to insert empty map data type into parquet format

Created on 24 Aug 2016 · 14Comments · Source: prestodb/presto

Hi,

I am trying to insert some data that might contain empty data for the map column into a parquet table and I kept getting:

Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

Looks similar to this avro issue: https://github.com/Parquet/parquet-mr/pull/169

Edward

Source

edwardxwu

Most helpful comment

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

PowerToThePeople111 on 22 Nov 2019

👍16

All 14 comments

This is a known issue and the root cause boils down to the Hive Parquet writers, which Presto uses to write Parquet data (see HIVE-11625).

nezihyigitbasi on 24 Aug 2016

But when I used hive/hadoop to perform the operation, it was able to complete the operation without error

edwardxwu on 24 Aug 2016

Which hive version are you using?
On Tue, Aug 23, 2016 at 7:43 PM Edward X. Wu [email protected]
wrote:

But when I used hive/hadoop to perform the operation, it was able to
complete the operation without error

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/5934#issuecomment-241942152,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKsny3lepk3JQ4dyiw25HO6Ay0O6TH6ks5qi6_NgaJpZM4JreWd
.

nezihyigitbasi on 24 Aug 2016

I was using hive 1.0 on emr. And the issue I am having is a bit different from the ticket you linked. Presto seems to fail if the map being inserted is empty (like {}).

edwardxwu on 25 Aug 2016

Here is a concrete example:

presto:default> create table test (a map<varchar(3), integer>) with (format='PARQUET');
CREATE TABLE
presto:default> show columns from test;
 Column |           Type           | Comment 
--------+--------------------------+---------
 a      | map(varchar(3), integer) |         
(1 row)

Query 20160824_221422_00259_hnsq7, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 70B] [2 rows/s, 196B/s]

presto:default> insert into test select MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]);
INSERT: 1 row

Query 20160824_221457_00260_hnsq7, FINISHED, 3 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> select * from test;
       a        
----------------
 {bar=2, foo=1} 
(1 row)

Query 20160824_221502_00261_hnsq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 344B] [3 rows/s, 1.05KB/s]

presto:default> insert into test select MAP(ARRAY[], ARRAY[]);

Query 20160824_221509_00262_hnsq7, FAILED, 3 nodes
Splits: 4 total, 2 done (50.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20160824_221509_00262_hnsq7 failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

presto:default> select MAP(ARRAY[], ARRAY[]);
 _col0 
-------
 {}    
(1 row)

Query 20160824_221535_00263_hnsq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

edwardxwu on 25 Aug 2016

👍1

If you re-run your query and check the complete stack trace (pass --debug to the cli) you will see that the root cause is the same as HIVE-11625. So it's really a problem with the underlying Hive Parquet writers. It's hard to fix it at Presto level unless Presto had its own Parquet writers.

Query 20160825_165119_00008_3zd6n failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
        at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
        at com.facebook.presto.hive.HivePageSink$HiveRecordWriter.addRow(HivePageSink.java:747)
        at com.facebook.presto.hive.HivePageSink.doAppend(HivePageSink.java:411)
        at com.facebook.presto.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:390)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.HivePageSink.appendPage(HivePageSink.java:390)
        at com.facebook.presto.spi.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:42)
        at com.facebook.presto.operator.TableWriterOperator.addInput(TableWriterOperator.java:207)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:384)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
        at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
        ... 23 more

nezihyigitbasi on 25 Aug 2016

Got it.. I will upvote there.

edwardxwu on 31 Aug 2016

I am having the same issue. How did you overcome this? Did you change your schema or replace null values?

genomics-geek on 22 Dec 2016

I am also facing this issue.. any workaround please

pranavoyo on 6 Aug 2018

@nezihyigitbasi You are right and the parquet writer does not allow empty maps, but I may have found something specific to presto. If you create an array that contains a struct, you'll get the same error.

CREATE TABLE empty_map_array (some_list ARRAY(ROW(thing varchar))) WITH (format = 'PARQUET');
INSERT INTO empty_map_array (some_list) VALUES (ARRAY[]);

you'll get

Query 20190215_193022_00006_auzvk, FAILED, 1 node
Splits: 51 total, 33 done (64.71%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20190215_193022_00006_auzvk failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

If I'm not mistaken, empty arrays should be allowed in parquet(tested in spark)

igorcalabria on 15 Feb 2019

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

PowerToThePeople111 on 22 Nov 2019

👍16

Not really a solution to the actual issue here but re: @PowerToThePeople111's problem, you can work around the write-to-Parquet error by writing the Athena CTAS query to JSON instead.

There are performance/compression tradeoffs, but they may be preferable to the hours I spent fruitlessly trying to coalse my nulls into compliance

alecbw on 4 Apr 2020

Had anyone come across ways to create Athena CTAS query by checking cardinality of nested structure? In my case, I have a struct that can have empty arrays.

abhishekupadhyaya on 23 Apr 2020

In our case, this happens as an Athena CTAS trying to write some rows with empty arrays. Our fix is to use nullif, e.g.;

INSERT INTO ... 
SELECT
  nullif(my_col, ARRAY[]) as my_col
FROM ...

Not sure if this is an option to your case, @alecbw, but if it is then it may be better than writing to json. @abhishekupadhyaya perhaps you can try checking for an empty struct?, maybe:

nullif(my_col, MAP(ARRAY[], ARRAY[])) as my_col

mgmarino on 2 Jun 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

how to pass dynamical parameter value to presto query in command line

kiranps11 · 3Comments

Hive connector - DROP TABLE does not work

tomz · 3Comments

Row type must have at least one parameter

rajeshd3v · 3Comments

Filter out null probe values from semi-join when safe to do so

electrum · 4Comments

v0.241 Oracle Connector logon denied

shigechuanqi · 3Comments