Presto: unable to insert empty map data type into parquet format

Created on 24 Aug 2016  Â·  14Comments  Â·  Source: prestodb/presto

Hi,

I am trying to insert some data that might contain empty data for the map column into a parquet table and I kept getting:

Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

Looks similar to this avro issue: https://github.com/Parquet/parquet-mr/pull/169

Edward

Most helpful comment

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

All 14 comments

This is a known issue and the root cause boils down to the Hive Parquet writers, which Presto uses to write Parquet data (see HIVE-11625).

But when I used hive/hadoop to perform the operation, it was able to complete the operation without error

Which hive version are you using?
On Tue, Aug 23, 2016 at 7:43 PM Edward X. Wu [email protected]
wrote:

But when I used hive/hadoop to perform the operation, it was able to
complete the operation without error

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/5934#issuecomment-241942152,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKsny3lepk3JQ4dyiw25HO6Ay0O6TH6ks5qi6_NgaJpZM4JreWd
.

I was using hive 1.0 on emr. And the issue I am having is a bit different from the ticket you linked. Presto seems to fail if the map being inserted is empty (like {}).

Here is a concrete example:

presto:default> create table test (a map<varchar(3), integer>) with (format='PARQUET');
CREATE TABLE
presto:default> show columns from test;
 Column |           Type           | Comment 
--------+--------------------------+---------
 a      | map(varchar(3), integer) |         
(1 row)

Query 20160824_221422_00259_hnsq7, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 70B] [2 rows/s, 196B/s]

presto:default> insert into test select MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]);
INSERT: 1 row

Query 20160824_221457_00260_hnsq7, FINISHED, 3 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> select * from test;
       a        
----------------
 {bar=2, foo=1} 
(1 row)

Query 20160824_221502_00261_hnsq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 344B] [3 rows/s, 1.05KB/s]

presto:default> insert into test select MAP(ARRAY[], ARRAY[]);

Query 20160824_221509_00262_hnsq7, FAILED, 3 nodes
Splits: 4 total, 2 done (50.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20160824_221509_00262_hnsq7 failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

presto:default> select MAP(ARRAY[], ARRAY[]);
 _col0 
-------
 {}    
(1 row)

Query 20160824_221535_00263_hnsq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

If you re-run your query and check the complete stack trace (pass --debug to the cli) you will see that the root cause is the same as HIVE-11625. So it's really a problem with the underlying Hive Parquet writers. It's hard to fix it at Presto level unless Presto had its own Parquet writers.

Query 20160825_165119_00008_3zd6n failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
        at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
        at com.facebook.presto.hive.HivePageSink$HiveRecordWriter.addRow(HivePageSink.java:747)
        at com.facebook.presto.hive.HivePageSink.doAppend(HivePageSink.java:411)
        at com.facebook.presto.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:390)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.HivePageSink.appendPage(HivePageSink.java:390)
        at com.facebook.presto.spi.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:42)
        at com.facebook.presto.operator.TableWriterOperator.addInput(TableWriterOperator.java:207)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:384)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
        at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
        ... 23 more

Got it.. I will upvote there.

I am having the same issue. How did you overcome this? Did you change your schema or replace null values?

I am also facing this issue.. any workaround please

@nezihyigitbasi You are right and the parquet writer does not allow empty maps, but I may have found something specific to presto. If you create an array that contains a struct, you'll get the same error.

CREATE TABLE empty_map_array (some_list ARRAY(ROW(thing varchar))) WITH (format = 'PARQUET');
INSERT INTO empty_map_array (some_list) VALUES (ARRAY[]);

you'll get

Query 20190215_193022_00006_auzvk, FAILED, 1 node
Splits: 51 total, 33 done (64.71%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20190215_193022_00006_auzvk failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

If I'm not mistaken, empty arrays should be allowed in parquet(tested in spark)

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

Not really a solution to the actual issue here but re: @PowerToThePeople111's problem, you can work around the write-to-Parquet error by writing the Athena CTAS query to JSON instead.

There are performance/compression tradeoffs, but they may be preferable to the hours I spent fruitlessly trying to coalse my nulls into compliance

Had anyone come across ways to create Athena CTAS query by checking cardinality of nested structure? In my case, I have a struct that can have empty arrays.

In our case, this happens as an Athena CTAS trying to write some rows with empty arrays. Our fix is to use nullif, e.g.;

INSERT INTO ... 
SELECT
  nullif(my_col, ARRAY[]) as my_col
FROM ...

Not sure if this is an option to your case, @alecbw, but if it is then it may be better than writing to json. @abhishekupadhyaya perhaps you can try checking for an empty struct?, maybe:

nullif(my_col, MAP(ARRAY[], ARRAY[])) as my_col
Was this page helpful?
0 / 5 - 0 ratings