Presto: INSERT creating new Hive table partition uses wrong field delimiters for text format

Created on 3 Feb 2017  路  9Comments  路  Source: prestodb/presto

Hi - When running INSERT INTO a hive table as defined below, it seems Presto is writing valid data files. However running subsequent SELECTs on the table will return all NULL values. When running the SELECT in Hive, the same files are read and displayed correctly. Is this a known/fixed bug or is there an issue with my table definition? I'm running Presto 0.152.3.

CREATE TABLE `test`(
  `d` timestamp,
  `a` string,
  `b` string,
  `c` string
)
PARTITIONED BY (
  `p` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='\t',
  'line.delim'='\n',
  'serialization.format'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket'
TBLPROPERTIES (
  'serialization.null.format'=''
)

Thanks

bug

Most helpful comment

Hi,
I found the same behaviour in presto version 0.170. Is it possible?

Thanks

All 9 comments

More details of the issue

1) Initial INSERT reads NULL values

presto> insert into test select TIMESTAMP '2016-12-01', 'a','b','c','p';
INSERT: 1 row
presto> select * from test;
  d   |  a   |  b   |  c   | p
------+------+------+------+---
 NULL | NULL | NULL | NULL | p

2) Second INSERT reads NULLs for first row but reads new row correctly

presto> insert into test select TIMESTAMP '2016-12-01', 'a','b','c','p';
INSERT: 1 row
presto> select * from test;
            d            |  a   |  b   |  c   | p
-------------------------+------+------+------+---
 NULL                    | NULL | NULL | NULL | p
 2016-12-01 00:00:00.000 | a    | b    | c    | p

I reproduced latest Presto build. This is just FYI, It seems that this issue occurs when you specify new partition as the value.

presto:default> insert into test2 select TIMESTAMP '2016-12-01', 'a','b','c','a';
INSERT: 1 row

presto:default> select * from test2;
            d            |  a   |  b   |  c   | p 
-------------------------+------+------+------+---
 NULL                    | NULL | NULL | NULL | p 
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
 NULL                    | NULL | NULL | NULL | a 
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
(5 rows)

presto:default> insert into test2 select TIMESTAMP '2016-12-01', 'a','b','c','b';
INSERT: 1 row

presto:default> select * from test2;
            d            |  a   |  b   |  c   | p 
-------------------------+------+------+------+---
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
 NULL                    | NULL | NULL | NULL | p 
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
 NULL                    | NULL | NULL | NULL | a 
 NULL                    | NULL | NULL | NULL | b 
 2016-12-01 00:00:00.000 | a    | b    | c    | p 
(6 rows)

I could confirmed following create definition.

CREATE TABLE test2(
d timestamp,
a string,
b string,
c string
)
PARTITIONED BY (p string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'line.delim'='\n',
'serialization.format'='\t')

I tried to debug this issue in code and found out that the value of feild.delim is not being passed correctly in the case of first insert in a partition.
In MetastoreUtil file, getHiveSchema() method is called which is picking its Storage schema and serde parameters by eventually making a getschema for partition call which does not have information about table delimiters, hence it tried to use default delimiters and returns null when it fails to read the value.

Nice found @ankitdixit ! Are you going to send the Pull Request?

Looking some more, i found out that the issue is presto does not use field delimiters specified while creating a table if there is not data in the partition. The first insert is done using default delimiters and subsequently the read fails.

Hi,
I found the same behaviour in presto version 0.170. Is it possible?

Thanks

This has been open for awhile. Should probably be flagged as a bug by one of the guys?

I also faced similar issue of getting Null value through Presto select query every time, I'm inserting data into hive partitioned table from presto (happens only for new partition only). Is there any work around ?

Just to confirm, experiencing the same issue on Presto 0.189. Seems like only text format is affected, using the target table with ORC storage format resolved the issue for me.

Was this page helpful?
0 / 5 - 0 ratings