Pulsar: Use timestamp as "__publish_time__" type in Pulsar SQL.

Created on 16 Jul 2019  路  6Comments  路  Source: apache/pulsar

Motivation

"__publish_time__" is Pulsar SQL internal column, as Pulsar only stores timestamps, it doesn鈥檛 store the timezone information. Use timestamp as "__publish_time__" type is more correct way in Pulsar SQL.

componensql typbug typfeature

Most helpful comment

Please also make sure that the publish time predicate push down is still working correctly after this change:

https://github.com/apache/pulsar/blob/master/pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSplitManager.java#L323

All 6 comments

@congbobo184 please take a look this issue.

Please also make sure that the publish time predicate push down is still working correctly after this change:

https://github.com/apache/pulsar/blob/master/pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSplitManager.java#L323

So I have done some experiments with changing __publish_time__ from TimestampWithTimeZoneType.TIMESTAMP_WITH_TIME_ZONE -> TimestampType .TIMESTAMP. When I change the type, the predicate doesn't get pushed down correctly from Presto. There seems to some issues with Presto.

Also this is another problem:

https://github.com/prestodb/presto/issues/6595

From my testing it seems in Presto if the column is of type TimestampWithTimeZoneType.TIMESTAMP_WITH_TIME_ZONE then timestamp in the predicate of a query with no timezone '2019-07-16 01:47:28.905' can be coerced to have a timezone and the predicate gets pushed down.

However if the column TimestampType .TIMESTAMP that doesn't work.

@jerrypeng I try to change type of " __publish_time__" to TimestampType .TIMESTAMP, and can pushed down correctly by using function from_iso8601_timestamp();

Here is my tests at superset:
image

Change value of " __publish_time__", the returned count of rows will be reduced accordingly:
image

And i check the detail of returned rows, looks correctly:
image

@codelipenghui predicate pushdown does not have anything to do with the correctness of the results. Its an optimization we use in Pulsar SQL to retrieve only the data that satisfies the predicate from bookies since entries are sorted by publish time.

Please take a look at this code:

https://github.com/apache/pulsar/blob/master/pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSplitManager.java#L295

In my tests, when I can TIMESTAMP_WITH_TIME_ZONE -> TIMESTAMP presto does not provide the correct predicate information for our connector to perform this optimization.

@jerrypeng I have create a new PR and verified "__publish_time__" push down is work well, PTAL.

Was this page helpful?
0 / 5 - 0 ratings