Presto: Add support for dynamic filtering

Created on 22 Feb 2017  路  16Comments  路  Source: prestodb/presto

fact_table - large fact table, partitioned on date_key
d_date - small dimension table, with significant correlation on it's columns and date_key.

Currently for the following query:

    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE 
    b.year = 2017;

larger probe table is fully scanned. When presto after scaning build table (right) could extract information that for b.year = 2017 there are only small number of matching b.date_key values. This information could be used to narrow down the table scan. In other words, above query could be dynamically "rewritten" to equivalent of:

    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE a.date_key IN (20170101,20170102, etc...)

All 16 comments

Here's a doc with the design, please mind it's a work in progress ... https://docs.google.com/document/d/1NYhEqYfmOPUqYFVzmOWvIXds5reFlaAi7z-zeJWqSLE

Hello,
is there any progress?

Yes, @maciejgrzybek and @arhimondr are well advanced in this effort.

Here's a PR addressing this issue: #8588. This is first version, handling inner joins on split pruning and stripe pruning level. Already after first pass of review.

good issues

I'm facing a similar issue and was hoping that #8588 would address it, but I built Teradata/presto@c701552 and it looks like Presto is still doing full table scans even with dynamic_partition_pruning enabled.

In my case, I have a huge Hive table of comments partitioned on post_id, and a Postgres table of posts, each of which belongs to a user. I'd like to be able to efficiently get the comment count for a given user's posts:

SELECT COUNT(*)
FROM hive.default.comments c
JOIN postgresql.public.posts p
ON c.post_id = p.id
WHERE p.user_id = '123'

Running this query as-is does a full table scan (which eventually causes the Presto workers to exceed their memory limit), whereas running two separate queries (one query on posts to get the post ids, and a second on comments with a predicate on post_id) is lightning-fast.

Since this query is virtually identical to the original one mentioned in this issue, I figured that #8588 would fix the issue, but it doesn't seem to have made any difference. Enabling or disabling the dynamic_partition_pruning session variable has no effect on the query plan.

Am I missing something? Is there some other configuration setting that needs to be changed in order to use this feature?

Thanks!

This could be solved by adding index join support to the PostgreSQL connector.

Is that to say that this PR is only relevant if the underlying connectors implement index joins?

@afn Sorry for the confusion. I was responding to your comment specifically, not talking about this PR. Presto supports index joins for connectors that implement it. Your use case looks like a perfect fit for index joins. There is an example of this in IndexedTpchPlugin.

I have raised a new PR based on earlier PR:https://github.com/prestodb/presto/pull/9453
New PR has bug fixes as mentioned in my review comments + new operators for dynamic filtering.
Can someone please review it ? I can break this PR into few smaller PRs if that is a concern.

@electrum I find the dynamic filtering also solve this same case mentioned by @afn . When should we use index join and when should we use dynamic filtering?

Is this feature still being added? I would love an update. I have seen there is an open pull request on the Presto master but there has been no activity for over a year

@tastyqbit Presto Foundation actively works on getting this feature in (https://github.com/prestosql/presto/pull/91). Once it is there - we will cherry-pick it.

@tooptoop4 https://github.com/prestosql/presto/pull/91 is only the planner part. Are the remaining parts are still in review in Presto SQL?

@tooptoop4

If the feature is fully implemented please feel free to do the required cherry-picks assembling the whole feature and open a PR. We will be happy to do a quick review of it and have it in.

@maciejgrzybek @arhimondr Do we know if anyone working on this feature? Thanks!

Was this page helpful?
0 / 5 - 0 ratings