Hi,
I am trying to insert some data that might contain empty data for the map column into a parquet table and I kept getting:
Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
Looks similar to this avro issue: https://github.com/Parquet/parquet-mr/pull/169
Edward
This is a known issue and the root cause boils down to the Hive Parquet writers, which Presto uses to write Parquet data (see HIVE-11625).
But when I used hive/hadoop to perform the operation, it was able to complete the operation without error
Which hive version are you using?
On Tue, Aug 23, 2016 at 7:43 PM Edward X. Wu [email protected]
wrote:
But when I used hive/hadoop to perform the operation, it was able to
complete the operation without error—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/prestodb/presto/issues/5934#issuecomment-241942152,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKsny3lepk3JQ4dyiw25HO6Ay0O6TH6ks5qi6_NgaJpZM4JreWd
.
I was using hive 1.0 on emr. And the issue I am having is a bit different from the ticket you linked. Presto seems to fail if the map being inserted is empty (like {}).
Here is a concrete example:
presto:default> create table test (a map<varchar(3), integer>) with (format='PARQUET');
CREATE TABLE
presto:default> show columns from test;
Column | Type | Comment
--------+--------------------------+---------
a | map(varchar(3), integer) |
(1 row)
Query 20160824_221422_00259_hnsq7, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 70B] [2 rows/s, 196B/s]
presto:default> insert into test select MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]);
INSERT: 1 row
Query 20160824_221457_00260_hnsq7, FINISHED, 3 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> select * from test;
a
----------------
{bar=2, foo=1}
(1 row)
Query 20160824_221502_00261_hnsq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 344B] [3 rows/s, 1.05KB/s]
presto:default> insert into test select MAP(ARRAY[], ARRAY[]);
Query 20160824_221509_00262_hnsq7, FAILED, 3 nodes
Splits: 4 total, 2 done (50.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20160824_221509_00262_hnsq7 failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
presto:default> select MAP(ARRAY[], ARRAY[]);
_col0
-------
{}
(1 row)
Query 20160824_221535_00263_hnsq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
If you re-run your query and check the complete stack trace (pass --debug to the cli) you will see that the root cause is the same as HIVE-11625. So it's really a problem with the underlying Hive Parquet writers. It's hard to fix it at Presto level unless Presto had its own Parquet writers.
Query 20160825_165119_00008_3zd6n failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
at com.facebook.presto.hive.HivePageSink$HiveRecordWriter.addRow(HivePageSink.java:747)
at com.facebook.presto.hive.HivePageSink.doAppend(HivePageSink.java:411)
at com.facebook.presto.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:390)
at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
at com.facebook.presto.hive.HivePageSink.appendPage(HivePageSink.java:390)
at com.facebook.presto.spi.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:42)
at com.facebook.presto.operator.TableWriterOperator.addInput(TableWriterOperator.java:207)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:384)
at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
... 23 more
Got it.. I will upvote there.
I am having the same issue. How did you overcome this? Did you change your schema or replace null values?
I am also facing this issue.. any workaround please
@nezihyigitbasi You are right and the parquet writer does not allow empty maps, but I may have found something specific to presto. If you create an array that contains a struct, you'll get the same error.
CREATE TABLE empty_map_array (some_list ARRAY(ROW(thing varchar))) WITH (format = 'PARQUET');
INSERT INTO empty_map_array (some_list) VALUES (ARRAY[]);
you'll get
Query 20190215_193022_00006_auzvk, FAILED, 1 node
Splits: 51 total, 33 done (64.71%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20190215_193022_00006_auzvk failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
If I'm not mistaken, empty arrays should be allowed in parquet(tested in spark)
I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.
Not really a solution to the actual issue here but re: @PowerToThePeople111's problem, you can work around the write-to-Parquet error by writing the Athena CTAS query to JSON instead.
There are performance/compression tradeoffs, but they may be preferable to the hours I spent fruitlessly trying to coalse my nulls into compliance
Had anyone come across ways to create Athena CTAS query by checking cardinality of nested structure? In my case, I have a struct that can have empty arrays.
In our case, this happens as an Athena CTAS trying to write some rows with empty arrays. Our fix is to use nullif, e.g.;
INSERT INTO ...
SELECT
nullif(my_col, ARRAY[]) as my_col
FROM ...
Not sure if this is an option to your case, @alecbw, but if it is then it may be better than writing to json. @abhishekupadhyaya perhaps you can try checking for an empty struct?, maybe:
nullif(my_col, MAP(ARRAY[], ARRAY[])) as my_col
Most helpful comment
I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.